diff --git a/.claude/skills/review-pr/SKILL.md b/.claude/skills/review-pr/SKILL.md new file mode 100644 index 00000000..813d9007 --- /dev/null +++ b/.claude/skills/review-pr/SKILL.md @@ -0,0 +1,377 @@ +--- +name: review-pr +description: Review a GitHub pull request against py-questdb-client (Cython + C-ABI) coding standards +argument-hint: [PR number or URL] [--level=0..3] +allowed-tools: Bash(gh *), Bash(git *), Read, Grep, Glob, Agent +--- + +Review the pull request `$ARGUMENTS`. + +## Review mindset + +You are a senior QuestDB engineer performing a blocking code review. `py-questdb-client` is mission-critical software: a **Cython** extension that wraps the **`c-questdb-client` (Rust) library** through its **C ABI**, and is used to ingest production data from customer Python applications. A bug here causes data loss, silent data corruption, segfaults that take down the host Python interpreter, reference-count leaks, or native memory leaks. There is zero tolerance for correctness issues, memory unsafety, refcount imbalance, GIL violations, or an FFI binding that disagrees with the C header it calls. Be critical, thorough, and opinionated. Your job is to catch problems before they ship, not to be nice. + +- **Assume nothing is correct until you've verified it.** Read surrounding code to understand context — don't just look at the diff in isolation. +- **The diff is a hint, not the boundary of the review.** The highest-value bugs almost always live at callsites outside the diff that depend on contracts the diff quietly changed (a `cdef` helper's error-return convention, a buffer's ownership, a `qdb_pystr_buf` arena's lifetime). Treat the diff as the entry point, not the scope. +- **Flag every issue you find**, no matter how small. Do not soften language or hedge. Say "this is wrong" not "this might be an issue". +- **Do not praise the code.** Skip "looks good", "nice work", "clever approach". Focus entirely on problems and risks. +- **Think adversarially.** For each change, work through: + - Inputs: which values break this? Empty buffers, zero-length strings, `None`, NaN/inf floats, boundary integers (`INT64_MAX`/`INT64_MIN`), max-length symbols, non-UTF-8 `str`, `bytes` with embedded NULs, huge `int` that overflows `int64_t`. + - Encoding: how does the code behave when a Python `str` contains lone surrogates, astral codepoints, or characters that fail UTF-8 encoding? + - Memory: every `malloc`/`calloc`/`realloc` — is it freed on the error path, the exception path, and the early-return path? Every `Py_INCREF` — is there a matching `Py_DECREF`? Every `PyObject_GetBuffer` — a matching `PyBuffer_Release`? + - GIL: does a `with nogil` block touch a Python object or call a CPython API function? Does a `cdef ... nogil` function need the GIL it doesn't hold? + - Failure modes: connection dropping mid-flush, partial write, TLS handshake failure, auth rejection, server rejection — does the buffer/sender end in a usable state, and does native memory get released? + - C-ABI callers: what happens when a C function returns `NULL`, returns an error via its out-param, or hands back a pointer the Cython side must free exactly once? +- **Check what's missing**, not just what's there. Missing tests, missing error handling, missing edge cases, missing `ingress.pyi` stub updates for public API changes, `.pxd` declarations out of sync with the C header. +- **Verify every claim.** If the PR title says "fix", verify the bug actually existed and the fix is correct. If it says "improve performance", look for benchmarks or reason about the change against the per-row hot path. If it says "simplify", verify the new code is actually simpler and doesn't drop behavior (e.g. a dropped `free` on an error branch). Treat the PR description as an unverified hypothesis. +- **Read the full context of changed files** when the diff alone is ambiguous. Use Read/Grep/Glob to inspect surrounding code, callers, and related tests. +- **Assess reachability before reporting.** For every potential bug, trace the actual callers and inputs. If a problem requires physically impossible conditions (a length larger than `SIZE_MAX`, a NUL injected through an API that already rejects it, a panic behind a validation guard), it is not a real finding — drop it. Focus on bugs that real workloads can trigger, not theoretical edge cases. +- **Never review generated or build artifacts.** `src/questdb/ingress.c`, `*.html` (Cython annotation), and `*.so` are build outputs. The source of truth is `*.pyx`, `*.pxi`, `*.pxd`, and `*.pyi`. If the diff contains a regenerated `ingress.c`, review the `.pyx`/`.pxi` change that produced it, not the generated C. + +## Review level + +Parse `$ARGUMENTS` for a level token: `--level=N`, `-lN`, or a bare single digit `0`-`3`. **If no level is given, default to 0.** Strip the level token before feeding the remainder (PR number or URL) to `gh` commands. + +The level controls how much of the review below actually runs. Lower levels keep the same review *spirit* — adversarial, blocking, no praise — but cut the breadth of the analysis. Higher levels have significantly higher token cost; reserve level 3 for high-stakes PRs (C-ABI `.pxd` changes, a `c-questdb-client` submodule bump, the dataframe/Arrow ingestion path, `nogil` sections, manual `malloc`/refcount code, ILP wire format, or auth/TLS configuration). + +| Level | What runs | +|-------|-----------| +| **0 (default)** | Steps 1, 2, 4. Skip Step 2.5. Skip Step 3 — no agent spawn; review the diff inline in the main loop, using Read/Grep on demand to resolve ambiguities. Skip Step 3b — verify each finding inline as you write it. Single-pass review covering correctness, Cython memory/refcount/GIL safety, C-ABI binding correctness, tests, and coding standards on the diff itself. | +| **1** | Adds Step 2.5a (semantic delta only — skip 2.5b/2.5c/2.5d). In Step 3, launch only Agent 1 (correctness), Agent 2 (Cython memory & refcount safety), and Agent 7 (tests) in parallel. Skip all other agents. Skip Step 3b — verify findings inline as you draft the report. | +| **2** | Full Step 2.5, but in 2.5b restrict the callsite inventory to public Python symbols (exported in `__all__` / `ingress.pyi`) plus every `cdef`/`cpdef` function and every C-ABI symbol declared in the `.pxd` files. In Step 3, launch Agents 1-8. Skip Agent 9 (cross-context) and Agent 10 (adversarial fresh-context). Step 3b uses a single batched verification agent for all findings instead of one per finding. | +| **3** | Every step below as written, all 10 agents, per-finding verification. The full mission-critical pass. | + +State the chosen level in one line at the start of the review so the user knows what they're getting (e.g., "Reviewing PR #141 at level 2"). If the level was defaulted, mention that level 3 exists for full review. + +## Step 1: Gather PR context + +Capture the PR identifier in `$PR` (the part of `$ARGUMENTS` left after stripping the level token), then fetch metadata, diff, and review comments in a single bash call so `$PR` is in scope for all three `gh` invocations: + +```bash +PR='' +gh pr view "$PR" --json number,title,body,labels,state +gh pr diff "$PR" +gh pr view "$PR" --comments +``` + +If the diff modifies `c-questdb-client` (the git submodule pointer) or any `.pxd` file, note it now — a submodule bump or binding change is the highest-risk class of change in this repo and forces level-3 scrutiny of the C-ABI surface regardless of the requested level. + +## Step 2: PR title and description + +Check: +- Title is clear and describes the change +- Description speaks to end-user impact, not implementation internals +- If fixing an issue, `Fixes #NNN` or a link to the issue is present +- Tone is level-headed and analytical +- For public API changes (anything in `__all__`, a new/changed method on `Sender`/`Buffer`/`Client`, a new keyword argument, or a changed default), the description calls out the API change explicitly, and `CHANGELOG.rst` is updated +- For a `c-questdb-client` submodule bump, the description states which upstream change is being pulled in and why + +## Step 2.5: Map the change surface + +Before launching review agents, produce a structured change surface map. This step is mandatory and must use Grep/Glob — do not reason about callsites from memory. The output of this step is required input for every agent in Step 3. + +### 2.5a Semantic delta per changed symbol + +For every modified or added function (`def`, `cdef`, `cpdef`), method, class, `cdef class` attribute, module-level constant, enum member, or C-ABI declaration in a `.pxd`, write: + +- **Symbol:** fully-qualified name (e.g., `questdb.ingress.Buffer.column`, `_dataframe`, `c_err_to_py`, `line_sender_buffer_column_f64`) +- **Before:** signature, return type, **Cython exception convention** (`except -1` / `except *` / `except? -1` / `except +` / none / `noexcept`), what it raises and on which inputs, `nogil`-ness, whether it touches Python objects, allocation behavior (`malloc`/`calloc`/`realloc`), refcount effect (does it steal/borrow/own a reference?), C-ABI ownership semantics (who frees returned pointers), thread-safety +- **After:** same fields +- **Delta:** one line stating what semantically changed + +"Refactored", "cleaned up", "improved", "simplified" are not acceptable deltas. State the actual behavioral difference. If nothing semantically changed, write "no behavioral change" — but only after checking, not as a default. + +### 2.5b Callsite inventory + +For every changed symbol that is public (in `__all__` / `ingress.pyi`), `cdef`/`cpdef`, declared in a `.pxd`, or a C-ABI function, run Grep across the repository to find every callsite, override, or reference outside the diff. + +Produce a list grouped by file. Search at minimum: + +- **Cython implementation & includes:** `grep -rn 'symbol_name' src/questdb/*.pyx src/questdb/*.pxi` +- **Cython C-ABI / helper declarations:** `grep -rn 'symbol_name' src/questdb/*.pxd` +- **Type stubs:** `grep -rn 'symbol_name' src/questdb/ingress.pyi` +- **C-ABI header (source of truth):** `grep -rn 'symbol_name' c-questdb-client/include/questdb/ingress/` +- **Rust helper crate:** `grep -rn 'symbol_name' rpyutils/src/ rpyutils/include/` +- **Unit & mock-server tests:** `grep -rn 'symbol_name' test/test.py test/mock_server.py test/test_tools.py` +- **System / integration tests:** `grep -rn 'symbol_name' test/system_test.py` +- **DataFrame tests, fuzz tests, leak tests:** `grep -rn 'symbol_name' test/test_dataframe.py test/test_client_dataframe_fuzz.py test/test_dataframe_fuzz.py test/test_dataframe_leaks.py test/test_client_capsule_path.py` +- **Examples:** `grep -rn 'symbol_name' examples/` +- **Docs:** `grep -rn 'symbol_name' docs/` + +A changed public / `cdef` / `.pxd` symbol with zero recorded Grep calls in the trace is a skill violation. The model is not allowed to assert "this is only used here" without showing the search. + +### 2.5c Implicit contract list + +For each changed symbol, walk this checklist and write one line per item, stating before vs after: + +- **Cython exception convention:** does the function return a C type with the right `except` clause? A `cdef` function returning `int`/`void`/a pointer with **no** `except` clause (or `noexcept`, the Cython 3 default for `nogil` functions) **silently swallows any Python exception raised inside it.** Did the convention change, and do all callers still propagate errors correctly? +- **Raises which exceptions on which inputs** (`IngressError`, `ValueError`, `TypeError`, `IngressServerRejectionError`, `UnsupportedDataFrameShapeError`) and which callers catch vs propagate them +- **Native memory:** does the symbol allocate (`malloc`/`calloc`/`realloc`) and who frees it? Does it free on every path including the exception path? +- **Reference counting:** does it `Py_INCREF`/`Py_DECREF`, store a borrowed `PyObject*`, hold a weakref/capsule, or return a borrowed vs owned reference? +- **Buffer protocol:** does it call `PyObject_GetBuffer` (and the matching `PyBuffer_Release`)? Does it keep the exporter alive while the raw pointer is in use? +- **GIL:** does it run under `nogil`? Does it release the GIL around a blocking C call (flush/connect)? Does it reacquire to raise? +- **C-ABI ownership:** does it pass a `line_sender_buffer`/`line_sender_utf8`/`qdb_pystr_buf` pointer into Rust, and who owns it afterward? Is a returned `line_sender_error*` freed exactly once (`line_sender_error_free`)? +- **`qdb_pystr_buf` arena lifetime:** are UTF-8 pointers obtained from the arena still valid after a subsequent `clear`/append (which may reallocate and invalidate earlier pointers)? +- **Buffer/sender state on error:** does a failed call leave the `Buffer` half-written, or the `Sender` in an unusable state requiring reconstruction? +- **`.pxd` ↔ C header agreement:** parameter types, `const`-ness, struct layout, enum discriminant order, return type — does the Cython declaration still match `c-questdb-client/include/questdb/ingress/*.h`? +- **`.pyi` ↔ implementation agreement:** does the stub still match the real signature, defaults, and return type? +- **Wire format:** any change to the ILP bytes produced (protocol v1 / v2), timestamp units, or column encoding. + +### 2.5d Cross-context exposure list + +End this step with an explicit list of "places this change is visible from but the diff does not touch". This is the highest-priority input for the bug-hunting agents in Step 3. + +Group the callsites from 2.5b by execution context. Typical contexts in this codebase: + +- **C-ABI binding surface:** every C-ABI function declared in `src/questdb/line_sender.pxd` / `conf_str.pxd` / `arrow_c_data_interface.pxd` / `mpdecimal_compat.pxd` / `rpyutils.pxd` that the changed code calls (transitively) +- **Buffer build hot path:** `Buffer.column`, `Buffer.symbol`, `Buffer.row`, `Buffer.at*`, and their `cdef` helpers +- **DataFrame / Arrow ingestion path:** everything in `dataframe.pxi`, the pandas/numpy/pyarrow/polars code paths, Arrow C Data Interface (`ArrowArray`/`ArrowSchema`/`ArrowArrayStream`) consumption and release callbacks, PyCapsule handling +- **Egress / query path:** `egress.pxi`, `QueryResult` +- **Flush path:** `Sender.flush`, `Buffer` → transport, the `with nogil` blocking sections +- **Auto-flush logic:** any callsite that triggers flush implicitly (row count / byte threshold / interval) +- **Configuration parsing:** `Sender.from_conf` / `from_env`, the `conf_str` parser, keyword-argument handling +- **Authentication / TLS:** auth token / basic-auth / TLS-CA configuration paths +- **`nogil` / threading surface:** the `active_senders` registry (`rpyutils/src/active_senders.rs`), any code reachable from multiple threads +- **`qdb_pystr_buf` arena users:** every function that obtains UTF-8 pointers from the per-`Buffer` string arena +- **Python type stubs:** `ingress.pyi` +- **Tests:** `test/test.py`, `test/system_test.py`, `test/test_dataframe.py`, fuzz and leak tests +- **Examples & docs:** `examples/*.py`, `docs/` + +Every entry on this list must be reviewed in Step 3. + +### 2.5e Build & binding profile facts + +**This sub-step runs at every level, including levels 0 and 1 where the rest of Step 2.5 is skipped.** A single Cython directive or a submodule bump can flip the safety story for the entire extension; agents must reason from the actual profile, not from defaults. + +Record, with file:line citations: + +- **Cython compiler directives** at the top of `ingress.pyx` and in `setup.py` (`language_level`, `binding`, and — if set — `boundscheck`, `wraparound`, `cdivision`, `initializedcheck`, `nonecheck`). If `boundscheck=False` / `wraparound=False`, **out-of-range or negative C-array/typed-memoryview indexing is undefined behavior, not an `IndexError`** — agents must treat indexing as a crash surface, not a guarded operation. +- **Cython exception-default fact:** in Cython 3, a `cdef`/`cpdef` function declared `nogil` (or any `cdef` returning a non-object type without an explicit `except` clause) defaults to `noexcept` — it **swallows Python exceptions silently**. Agents 1, 2, and 3 must check the actual `except` clause on every changed `cdef` and not assume exceptions propagate. +- **`c-questdb-client` submodule commit** (`git submodule status`) — if the diff moves it, the pinned commit's headers under `c-questdb-client/include/questdb/ingress/` are the *new* source of truth that every `.pxd` must match. Re-verify the `.pxd` ↔ `.h` agreement against the new commit. +- **`rpyutils` Rust crate:** if `rpyutils/src/**` or `rpyutils/Cargo.toml` changed, note its panic/profile behavior — a panic in `rpyutils` reached across the C ABI aborts the Python process. Its headers (`rpyutils/include/`, generated via `cbindgen.toml`) must match `rpyutils.pxd`. +- **Minimum numpy / Python versions** (`pyproject.toml`: `requires-python`, `numpy>=1.21.0`). Code that uses a newer numpy C-API or Python C-API symbol than the floor breaks the oldest supported build. State the floor. +- **`abort()` is imported** (`from libc.stdlib cimport ... abort`). Any reachable `abort()` call, or any Rust panic that crosses the C ABI, terminates the host interpreter with no traceback. Flag the path. + +A review without this section is incomplete. State the relevant facts (directives, exception default, submodule commit) in one line at the top of every Step 3 agent prompt so the agent reasons from the right premise. + +## Step 3: Parallel review + +Every agent receives: +1. The PR diff +2. The full change surface map from Step 2.5 (semantic deltas, callsite inventory, implicit contracts, cross-context exposure list, build & binding profile facts) + +### Anti-anchoring directive (applies to all agents) + +- **Bugs at callsites outside the diff outrank bugs inside the diff.** A confirmed bug in a file the PR did not touch but that calls a changed symbol is a P0 finding. +- **"Looks correct in isolation" is not a valid conclusion.** Before clearing a changed symbol, the agent must walk the callsite inventory from 2.5b and explicitly state, per callsite, whether the new behavior is still correct there. +- **The diff is the entry point, not the scope.** If the change surface map shows the symbol is reachable from N other files, the review covers N+1 files. +- **Project-wide settings affect untouched code.** A change to a Cython directive in `ingress.pyx` or `setup.py` (e.g. flipping `boundscheck` off), a `c-questdb-client` submodule bump, or a `.pxd` declaration change retroactively changes the safety/ABI story for **every** function that compiles under that directive or calls that binding — not just the diff. When directives, `setup.py`, `pyproject.toml`, or `.pxd`/submodule pointers appear in the diff, the review covers the affected surface of the whole extension, not just the touched lines. +- A single finding of the form "in `dataframe.pxi` the new behavior of `Buffer.column` leaks `b.validity` on the exception path" is worth more than five findings inside the diff. + +### Agents + +Launch the following agents in parallel. + +**Agent 1 — Correctness & bugs:** `None`/NULL handling, edge cases, logic errors, off-by-one, operator precedence, error paths. Integer correctness across the Python↔C boundary: Python `int` → `int64_t`/`size_t` conversion and overflow, `` / `` / `` casts that truncate or wrap, signed/unsigned mismatches, negative-length math. NaN/inf float handling. Timestamp unit conversions (micros vs nanos). Correct ILP wire format (v1 / v2). Cross-reference every changed symbol against its callsite inventory and verify the new behavior is correct at each callsite. + +**Agent 2 — Cython memory, refcount & crash surface:** In a Cython extension, anything that corrupts memory or aborts the native side takes down the host Python interpreter with no traceback. Flag every reachable instance of: + +- **Native memory leaks / double-free / use-after-free:** every `malloc`/`calloc`/`realloc` must be `free`d on **all** paths — success, early `return`, and the exception/`except` path (prefer `try/finally`). A `realloc` whose return value is assigned back to the same pointer leaks the original on failure (it returns `NULL` without freeing). Freeing a pointer twice, or using it after `free`, corrupts the heap. +- **Reference-count errors:** every `Py_INCREF` needs a matching `Py_DECREF` on all paths; a missing `DECREF` leaks, an extra `DECREF` causes a later use-after-free crash. Borrowed references (`PyWeakref_GetObject`, dict/list borrows, `PyObject*` stored without incref) must not outlive their owner. Verify `PyCapsule` and weakref handling. +- **Buffer-protocol imbalance:** every `PyObject_GetBuffer` must have a matching `PyBuffer_Release` on all paths, and the raw pointer must not be used after the exporting object can be collected. +- **Indexing under `boundscheck=False`:** per 2.5e, C-array and typed-memoryview indexing is unchecked — an out-of-range or negative index is UB, not an exception. Verify bounds are established before every index on the hot path. +- **Silent exception swallowing:** a `cdef` function returning a C type without the correct `except` clause (or `noexcept`) drops Python exceptions on the floor, turning an error into wrong data. Verify the `except` convention against what the body raises. +- **Direct aborts:** any reachable `abort()` (it is imported), and any **Rust panic crossing the C ABI** (from `c-questdb-client` or `rpyutils`) — both terminate the interpreter. The only defense is that the native side returns an error code/`line_sender_error*`, never panics. +- **Uninitialized memory:** a struct field or `malloc`'d region read before it is written (use `calloc` or explicit init), especially partially-built `pyobj_built_t`-style structs on an error path that then get freed. + +State the relevant build facts (directives, exception default, submodule commit) from 2.5e in the agent's first sentence, and evaluate every finding under the actual settings, not the textbook defaults. + +**Agent 3 — C-ABI boundary safety:** Check every call into the `c-questdb-client` / `rpyutils` C ABI. Verify: +- **`.pxd` matches the C header.** For every changed or called C-ABI symbol, read the actual declaration in `c-questdb-client/include/questdb/ingress/*.h` (or `rpyutils/include/`) and confirm the `.pxd` declaration matches it exactly: parameter types, pointer/`const`-ness, return type, struct field order and types, enum discriminant order. A mismatch is silent memory corruption / ABI breakage. If the submodule pointer moved, verify against the **new** pinned commit. +- **NULL handling:** every pointer returned from a C function checked before dereference; every pointer argument that could be `NULL` handled. +- **Error object lifecycle:** every `line_sender_error*` obtained via an out-param is converted (`c_err_to_py`) and freed exactly once (`line_sender_error_free`) — never leaked, never double-freed, never freed then read. +- **Ownership transfer:** `line_sender_buffer`, `line_sender_utf8`, `qdb_pystr_buf`, `line_sender` handles — who allocates, who frees, and is the lifetime correct relative to the owning `cdef class` (`__cinit__`/`__dealloc__`)? +- **`qdb_pystr_buf` arena invalidation:** UTF-8 pointers handed to Rust must remain valid until the buffer write completes and must not be invalidated by an intervening arena `clear`/append. +- **String encoding:** Python `str` → UTF-8 (`line_sender_utf8`), correct length passed, no lone surrogates, embedded-NUL handling, `bytes` vs `str` distinction. + +**Agent 4 — GIL & concurrency:** Verify: +- **`nogil` correctness:** no `with nogil` block (or `cdef ... nogil` function) touches a Python object, calls the CPython C-API, raises a Python exception, or `INCREF`/`DECREF`s — doing so without the GIL is a crash/corruption. Errors discovered under `nogil` must be deferred and raised after reacquiring the GIL. +- **GIL release around blocking calls:** the flush/connect/network C calls should release the GIL (`with nogil`) so other threads run; verify the released region doesn't reference Python state. +- **Thread-safety:** `Sender`, `Buffer`, and the `active_senders` registry (`rpyutils/src/active_senders.rs`) — verify documented thread-safety matches the implementation, and that shared mutable state reachable from multiple threads is synchronized. Cross-reference every callsite from 2.5b for violations of the concurrency contract. +- **Free-threaded build:** if the change assumes the GIL serializes access, note whether it holds under a free-threaded (no-GIL) CPython build (the CI matrix includes `*t` free-threaded targets). + +**Agent 5 — Resource management & lifecycle:** Leaks on all code paths (especially errors). Check `__cinit__`/`__dealloc__` pairing on every `cdef class` (does `__dealloc__` free everything `__cinit__` and methods allocated, and is it safe when `__cinit__` failed partway?). Native handle lifecycle (`line_sender`, `line_sender_buffer`, `qdb_pystr_buf`). Socket/connection/TLS teardown on error (handled by Rust, but verify the Cython side calls close/free). **Arrow C Data Interface:** `ArrowArray`/`ArrowSchema`/`ArrowArrayStream` `release` callbacks invoked exactly once; PyCapsule consumption semantics correct; no double-release. Walk every callsite from 2.5b that constructs, owns, or transfers ownership of a native handle and verify cleanup on all paths (success, exception, early return). + +**Agent 6 — Performance & allocations:** Unnecessary work on hot paths — the per-row buffer build (`Buffer.column`/`symbol`/`row`) and the per-column DataFrame loop (`dataframe.pxi`). Flag: Python-level operations (attribute lookups, `dict` access, object boxing, `str` re-encoding) inside the inner per-row/per-cell loop that should be hoisted or done at C level; allocations per row/cell that should be amortized; excessive copying of data that could be zero-copy via the buffer protocol / Arrow; O(n²) patterns over rows or columns. Analyze scaling at realistic volume: millions of rows per flush, hundreds of columns. Setup-path costs (sender construction, config parsing, schema inspection done once per DataFrame) are acceptable; per-row/per-cell costs are not. + +**Agent 7 — Test review & coverage:** Coverage gaps, error-path tests, `None`/edge-case tests, boundary conditions, regression tests, test quality. Check: +- Unit / mock-server tests in `test/test.py` (uses `test/mock_server.py`) +- System / integration tests against a real QuestDB in `test/system_test.py` +- DataFrame tests in `test/test_dataframe.py`, fuzz tests in `test/test_client_dataframe_fuzz.py` / `test/test_dataframe_fuzz.py`, and **leak tests** in `test/test_dataframe_leaks.py` (new native-memory or refcount handling should have a leak test) +- Capsule / Arrow path tests in `test/test_client_capsule_path.py` +- Examples in `examples/` still run (and `examples.manifest.yaml` is consistent) + +Cross-reference 2.5d: every cross-context exposure should have a test that exercises the changed symbol from that context. Missing tests for cross-context callsites — especially a new native-memory path without a leak test, or a new C-ABI binding without a system test — is a high-priority finding. + +**Agent 8 — Code quality & API design:** Public API ergonomics and consistency. **`ingress.pyi` stub must match the implementation** (signatures, defaults, return types, new symbols added to `__all__`). Docstrings on public classes/methods. `CHANGELOG.rst` updated for user-visible changes. Backward compatibility of the Python API (renamed/removed kwargs, changed defaults, changed exception types) — breaking changes must be intentional and called out in the PR body. Naming consistent with the codebase. No dead code, no unused `cimport`/`import`. Docs under `docs/` updated for API changes. + +**Agent 9 — Cross-context caller impact:** Walk the callsite inventory from 2.5b. For every callsite, fetch the surrounding code (the calling function plus its callers up two levels) and answer: + +- Does this caller pass inputs the new behavior handles incorrectly? +- Does this caller depend on a contract from the implicit contract list (2.5c) that the change broke — e.g. relying on the old `except` convention, the old ownership of a buffer, the old `qdb_pystr_buf` lifetime, the old refcount behavior? +- Is this caller in a context (a `with nogil` block, the per-row hot loop, an auto-flush trigger, an Arrow release callback, a `__dealloc__`, an exception/error path) where the new behavior misbehaves even if the inputs are valid? +- For a changed `cdef`/`cpdef` exception convention: do all callers still detect and propagate the error? +- For a changed C-ABI declaration: does the `.pxd` still match the C header, and do all Cython callers pass the right types/ownership? +- For a changed buffer/sender state machine: do all callers respect the new state transitions (buffer cleared after error before reuse; flush only when flushable)? + +This agent's output is structured per callsite, not per failure mode. Each callsite gets a verdict: SAFE / BROKEN / NEEDS VERIFICATION. Every BROKEN entry is a P0 finding regardless of whether the file is in the diff. + +This agent is not optional even when the diff is small. Small diffs to widely-used symbols (`Buffer.column`, `Sender.flush`, the dataframe entry point, a C-ABI binding) have the largest blast radius. + +**Agent 10 — Fresh-context adversarial:** Dispatched separately from agents 1-9 to escape checklist anchoring. This agent operates under different rules from the rest: + +- It receives ONLY the PR diff and the names of the changed files. It does NOT receive the change surface map from Step 2.5, the implicit contract list, the cross-context exposure list, or any of the review checklists below. +- Its sole instruction: "find ways this code is wrong". No category list, no failure-mode taxonomy, no project-specific style guide. +- It is free to use Read, Grep, and Glob to explore the repository however it wants. +- Findings are not pre-classified by category. Each finding states: what's wrong, why it's wrong, and the code path that demonstrates it. + +The point of this agent is to surface bugs the structured agents cannot see because they are reasoning inside the same frame. A finding here that none of agents 1-9 produced is high signal — it means the structured review missed it. A finding here that overlaps with agents 1-9 is corroboration. + +Run this agent in parallel with agents 1-9. It is mandatory regardless of diff size. + +Combine all agent findings into a single deduplicated **draft** report. Do NOT present this draft to the user yet — it goes straight into verification. + +## Step 3b: Verify every finding against source code + +The parallel review agents work from the diff plus the change surface map and frequently produce false positives — especially around native memory ownership, refcounting, GIL boundaries, Cython exception conventions, and C-ABI lifecycle. Every finding MUST be verified before it is reported. + +For each finding in the draft report: + +1. **Read the actual source code** at the exact lines cited (in the `.pyx`/`.pxi`/`.pxd`/`.pyi`, never the generated `ingress.c`). Do not rely on the agent's description alone. +2. **Trace the full code path:** follow callers and `cdef` helpers. Remember Cython's `include` model — `dataframe.pxi` and `egress.pxi` are textually included into `ingress.pyx`, so symbols are shared across them. +3. **Check both sides of the C ABI:** if a finding involves Cython↔Rust interaction, read both the Cython call and the C header in `c-questdb-client/include/questdb/ingress/` (or `rpyutils/include/`). Verify ownership transfer, error propagation, and freeing on both sides. +4. **For native-memory-leak claims:** trace every `malloc`/`calloc`/`realloc` to its `free` on ALL paths (success, early return, `except`/exception unwind). Confirm the intervening code can actually raise before claiming the exception path leaks. +5. **For refcount claims:** count `Py_INCREF`/`Py_DECREF` on every path; confirm borrowed-vs-owned reasoning against the CPython C-API contract of each function used. +6. **For exception-swallowing claims:** check the actual `except` clause on the `cdef` and whether the body can raise. Under Cython 3 a `nogil` `cdef` defaults to `noexcept` — confirm whether that's the real declaration. +7. **For GIL claims:** verify the cited code is actually inside a `nogil` region and actually touches a Python object / C-API; a `cdef` function called from `nogil` may itself acquire the GIL. +8. **For C-ABI / `.pxd` mismatch claims:** read the exact declaration in the pinned header and compare field-by-field. A claimed mismatch that actually matches is a false positive. +9. **For numeric overflow/truncation claims:** check reachability at realistic scale — ILP buffers up to a few hundred MB, millions of rows per flush, columns in the tens to low hundreds. Drop overflows that require values beyond that scale. +10. **For performance claims:** confirm the cost is on the per-row/per-cell hot path and measurable relative to surrounding I/O. Downgrade negligible savings to a nit. Exception: a per-row or per-cell allocation / Python-object operation on the buffer-build path is always worth flagging. +11. **For cross-context findings (Agent 9):** re-read the callsite in full, including callers up two levels, and confirm the broken behavior is reachable from production or test paths users will exercise. + +**Classify each finding** as: +- **CONFIRMED in-diff** — the bug is real and inside the diff +- **CONFIRMED at out-of-diff callsite** — the bug is in an unchanged file because the changed symbol is used there in a way that's now broken (cite the file and the contract from 2.5c that was violated) +- **FALSE POSITIVE** — the code is actually correct (explain why) +- **CONFIRMED with nuance** — the issue exists but is less severe than stated (explain) + +**Move false positives to a separate "Downgraded" section** at the end of the report. For each, give a one-line explanation of why it was dismissed. This lets the PR author verify the reasoning and catch verification mistakes. + +Launch verification agents in parallel where findings are independent. Each verification agent should read surrounding source files, not just the diff. + +## Review checklists + +Review the diff for: + +### Correctness & bugs +- `None`/NULL handling at API boundaries +- Edge cases and error paths +- Logic errors, off-by-one, incorrect bounds, wrong operator precedence +- Integer overflow/truncation across the Python↔C boundary (`int` → `int64_t`/`size_t`, ``/`` casts, signed/unsigned) +- Float edge cases (NaN, inf), timestamp unit conversions (micros vs nanos) +- Correct ILP wire format (v1 / v2) +- **Reachability expansion:** for each changed symbol, list the new contexts it can appear in (DataFrame path, `nogil` section, auto-flush, Arrow callback, error path) and verify it works in each. + +### Cython memory & refcount safety +- Every `malloc`/`calloc`/`realloc` freed on success, early-return, and exception paths (prefer `try/finally`); no double-free, no use-after-free; `realloc`-failure path doesn't leak the original +- Every `Py_INCREF` matched by `Py_DECREF`; borrowed references not outliving their owner; weakref/capsule handling correct +- Every `PyObject_GetBuffer` matched by `PyBuffer_Release`; exporter kept alive while the pointer is used +- Correct Cython `except` convention on every `cdef`/`cpdef` returning a C type (no silent exception swallowing; `noexcept` is the Cython-3 default for `nogil` `cdef`) +- No reachable `abort()`, and no Rust panic crossing the C ABI (both kill the interpreter) +- Indexing safe under the active `boundscheck`/`wraparound` directives +- No uninitialized struct/heap memory read (use `calloc` or init before use, especially on partially-built error paths) + +### C-ABI boundary +- `.pxd` declarations match `c-questdb-client/include/questdb/ingress/*.h` (and `rpyutils/include/`) exactly — types, `const`, struct layout, enum order, return type — against the **pinned** submodule commit +- All pointers returned from C checked for NULL before dereference +- Every `line_sender_error*` freed exactly once (`line_sender_error_free`), never double-freed or leaked +- Ownership semantics clear and correct (who allocates the handle, who frees it, lifetime vs the owning `cdef class`) +- `qdb_pystr_buf` arena pointers stay valid until consumed; not invalidated by an intervening `clear`/append +- String handling: `str` → UTF-8 with correct length, lone-surrogate rejection, embedded-NUL handling, `bytes`/`str` distinction +- ABI stability: a submodule bump that reorders a struct or renumbers an enum requires matching `.pxd` updates + +### GIL & concurrency +- No Python object access / C-API call / refcount op / raise inside a `with nogil` block or `cdef ... nogil` function +- GIL released around blocking network/flush C calls; released region references no Python state; errors deferred and raised after reacquiring +- `Sender`/`Buffer`/`active_senders` thread-safety matches documentation; shared mutable state synchronized +- Assumptions that the GIL serializes access re-checked for the free-threaded CPython build + +### Performance +- No per-row/per-cell Python-level operations (attribute/dict lookups, boxing, `str` re-encoding) in the buffer-build or DataFrame inner loops that belong at C level or hoisted to setup +- No per-row/per-cell allocations that should be amortized +- Zero-copy where possible (buffer protocol, Arrow) instead of copying +- No O(n²) over rows or columns at realistic scale (millions of rows, hundreds of columns) + +### Resource management +- `__cinit__`/`__dealloc__` pair frees everything allocated, and `__dealloc__` is safe after a partially-failed `__cinit__` +- Native handles (`line_sender`, `line_sender_buffer`, `qdb_pystr_buf`) released on all paths +- Socket/connection/TLS cleanup on error (Cython side invokes the Rust close/free) +- Arrow `release` callbacks invoked exactly once; PyCapsule consumed correctly; no double-release +- No leak through the C-ABI boundary (ownership documented and consistent) + +### Code quality +- `ingress.pyi` stub matches the implementation (signatures, defaults, return types, `__all__`) +- Public API consistent and ergonomic; backward-compatible (or breaking changes called out in the PR body) +- `CHANGELOG.rst` updated for user-visible changes; `docs/` updated for API changes +- Docstrings on public classes/methods +- Naming consistent with the codebase; no dead code or unused `import`/`cimport` + +### Test review +- **Coverage gaps:** every new/changed code path has a corresponding test; flag missing ones explicitly as "missing test for X" +- **Cross-context coverage:** every entry in the cross-context exposure list (2.5d) has a test exercising the changed symbol from that context +- **Leak coverage:** new native-memory or refcount-handling code has a test in `test/test_dataframe_leaks.py` (or equivalent) +- **Error-path coverage:** failure cases, partial writes, connection drops, TLS/auth failures, server rejections, and edge conditions tested — not just the happy path +- **Edge-case tests:** `None`, empty buffers, zero-length strings, max-length symbols, boundary integers, NaN/inf, non-UTF-8 strings +- **C-ABI / binding changes** covered by a system test in `test/system_test.py` +- **DataFrame / Arrow changes** covered in `test/test_dataframe.py` and the fuzz/capsule tests +- **Test quality:** tests assert the right thing; watch for trivially-passing tests +- **Regression tests:** a bug fix has a test that reproduces the original bug and fails without the fix + +### Unresolved TODOs and FIXMEs +- Scan the diff for `TODO`, `FIXME`, `HACK`, `XXX`, `WORKAROUND`. For each: + - Pre-existing (just moved/reformatted) or newly introduced in this PR? + - If new: unfinished work that should block merge, or an acceptable known limitation? Flag deferred bugs or incomplete implementations. + - If it references a ticket/issue, verify the reference exists. + +### Commit messages +- Plain English titles, under 50 chars +- Active voice, naming the acting subject + +## Step 4: Output + +Present ONLY verified findings (false positives are excluded from Critical/Moderate/Minor). Structure as: + +### Critical +Issues that must be fixed before merge. Each must include: +- Exact file path and line numbers (including out-of-diff files) +- Whether the finding is **in-diff** or **out-of-diff** +- Code path trace showing why the bug is real +- For out-of-diff findings: the contract from 2.5c that was violated and the callsite that triggers it +- Suggested fix + +### Moderate +Issues worth addressing but not blocking. + +### Minor +Style nits and suggestions. + +### Downgraded (false positives) +Findings from the initial review that were dismissed after source code verification. For each, state: +- The original claim (one line) +- Why it was dismissed (one line, citing the specific code that disproves it) + +### Summary +- One-line verdict: approve, request changes, or needs discussion +- Highlight any regressions or tradeoffs +- State how many draft findings were verified vs dropped as false positives (e.g., "8 findings verified, 4 false positives removed") +- State the in-diff vs out-of-diff split (e.g., "5 findings in-diff, 3 findings out-of-diff"). If the diff is non-trivial and out-of-diff is zero, the cross-context pass likely underran — re-invoke Agent 9 with a wider grep before finalizing. diff --git a/CHANGELOG.rst b/CHANGELOG.rst index c3deae50..68351f40 100644 --- a/CHANGELOG.rst +++ b/CHANGELOG.rst @@ -5,6 +5,123 @@ Changelog ========= +Unreleased +---------- + +Features +~~~~~~~~ + +QWP Ingestion Protocol +********************** + +Adds support for the QuestDB Wire Protocol (QWP) alongside the existing +ILP transports. + +- **QWP/UDP** (``qwpudp::``): fire-and-forget datagram ingestion, + defaulting to port 9007. New configuration keys ``max_datagram_size`` + and ``multicast_ttl``; ``protocol_version`` does not apply. +- **QWP/WebSocket** (``qwpws::`` / ``qwpwss::``): acknowledged streaming + ingestion with frame-sequence-number (FSN) tracking. New ``Sender`` + methods ``flush_and_get_fsn``, ``flush_and_keep_and_get_fsn``, + ``published_fsn``, ``acked_fsn``, ``await_acked_fsn``, ``drive_once``, + ``poll_qwp_ws_error``, ``qwp_ws_errors_dropped`` and ``close_drain``. + Server diagnostics are reported through a ``qwp_ws_error_handler`` + callback or polled as :class:`QwpWsError` values; terminal server + rejections raise :class:`IngressServerRejectionError`. + +Additional configuration keys ``tls_roots_password``, +``retry_max_backoff_millis`` and ``qwp_ws_progress`` are also accepted. +Every new key is equally available as a ``Sender`` / ``Sender.from_conf`` / +``Sender.from_env`` keyword argument (``max_datagram_size``, +``multicast_ttl``, ``tls_roots_password``, ``retry_max_backoff``, +``qwp_ws_progress`` and ``qwp_ws_error_handler``). + +Buffer Factories +**************** + +``Buffer.ilp()`` and ``Buffer.qwp()`` construct protocol-specific +buffers. Direct ``Buffer(...)`` construction is deprecated in favour of +these factories and ``Sender.new_buffer()``. + +Query Egress +************ + +Adds :class:`Client` with :meth:`Client.query`, returning a +:class:`QueryResult` that streams rows as Arrow record batches over the +QWP/WebSocket read endpoint. Results can be consumed via ``to_arrow``, +``to_pandas``, ``to_polars``, ``iter_arrow``, ``iter_pandas``, +``iter_polars`` or the Arrow C stream PyCapsule protocol +(``__arrow_c_stream__``). ``to_polars`` / ``iter_polars`` use pyarrow to +buffer failover-safe batches; ``__arrow_c_stream__`` (consumed as +``polars.from_arrow(result)``) is the pyarrow-free polars path. SYMBOL +columns are dictionary-encoded on the wire and map to a pandas +``Categorical`` (``to_pandas`` / ``iter_pandas``) or a polars +``Categorical`` (``to_polars`` / ``iter_polars``), the latter sharing one +persistent ``Categories`` identity across streamed batches so +``polars.concat`` stitches them without a categories-mismatch error. + +``to_pandas`` / ``iter_pandas`` default to a native (no-pyarrow) build +straight from the QWP column buffers: a nullable integer column becomes a +pandas nullable ``Int*`` when it contains nulls and plain numpy otherwise, +``double`` stays numpy with ``NaN``, ``TIMESTAMP`` → ``datetime64``, and +the QuestDB column kinds are recorded in ``df.attrs['questdb']`` for a type +round-trip through :meth:`Client.dataframe`. Pass +``dtype_backend="pyarrow"`` / ``"numpy_nullable"`` (or ``types_mapper=``) +to select the pyarrow-backed conversion instead. + +:class:`Client` is a context manager and exposes :meth:`Client.close` and +:meth:`Client.reap_idle` for pooled-connection lifecycle management. + +Columnar DataFrame Ingestion +**************************** + +Adds :meth:`Client.dataframe`, ingesting pandas / polars / pyarrow and +any Arrow C Data Interface object over QWP/WebSocket. A +``schema_overrides`` keyword reclassifies columns as ``symbol``, +``ipv4``, ``char`` or ``geohash`` (e.g. ``{'addr': 'ipv4', 'loc': +('geohash', 20)}``). + +The designated-timestamp argument ``at`` is the timestamp column itself, +given by name (``str``) or position (``int``); unlike +:meth:`Sender.dataframe` / :meth:`Buffer.dataframe` it does not accept a +scalar ``datetime`` / ``TimestampNanos`` / ``ServerTimestamp``. A frame +produced by :meth:`QueryResult.to_pandas` round-trips back to the same +QuestDB column types automatically: the kinds recorded in +``df.attrs['questdb']`` and pandas nullable extension dtypes are honoured. + +Errors +****** + +Adds :class:`UnsupportedDataFrameShapeError` (raised when a DataFrame +cannot be expressed on the QWP columnar path) and the +:class:`IngressErrorCode` members ``ServerRejection``, +``ArrowUnsupportedColumnKind``, ``ArrowIngest`` and ``Cancelled``. +:class:`IngressError` gains a ``qwp_ws_error`` property exposing the +structured :class:`QwpWsError` view on a server-side QWP/WebSocket +rejection. + +Build & dependencies +~~~~~~~~~~~~~~~~~~~~~~ + +- The minimum supported Python is raised from 3.8 to **3.10**; Python 3.8 + and 3.9 are no longer supported. +- ``numpy>=1.21.0`` is now a hard runtime dependency (previously it was + pulled in only via the ``dataframe`` extra). +- **pyarrow is now optional.** It is imported lazily, only when actually + needed (``pd.ArrowDtype`` columns, pyarrow sources, ``schema_overrides``, + and the ``to_arrow`` / ``iter_arrow`` / ``dtype_backend`` helpers). The + ``to_polars`` / ``__arrow_c_stream__`` egress paths and the default + ``to_pandas`` / ``iter_pandas`` work without pyarrow. +- The ``dataframe`` extra now pins ``pandas>=1.3.5`` and + ``pyarrow>=10.0.1``. + +Deprecations +~~~~~~~~~~~~ + +- Direct ``Buffer(...)`` construction is deprecated and emits a + ``DeprecationWarning``. Use ``Buffer.ilp()``, ``Buffer.qwp()`` or + ``Sender.new_buffer()`` instead. + 4.1.0 (2025-11-28) ------------------ diff --git a/EGRESS_FAILOVER_REVIEW.md b/EGRESS_FAILOVER_REVIEW.md new file mode 100644 index 00000000..0aade9c7 --- /dev/null +++ b/EGRESS_FAILOVER_REVIEW.md @@ -0,0 +1,224 @@ +# Egress Failover Review — Rust/Python client vs Java reference & server contract + +**Date:** 2026-06-18 +**Branch:** `jh_conn_pool_refactor` (submodule `c-questdb-client` @ `6fd1989`) +**Scope:** the QWP/WebSocket **egress (query/read)** failover path — multi-endpoint +walk, host-health tracking, retry budget, role-mismatch taxonomy, connect/handshake/TLS, +and test coverage — compared against: + +- **Java reference client:** `/home/jara/devel/oss/java-questdb-client` (`QwpQueryClient`, `QwpHostHealthTracker`) +- **Server wire contract:** `/home/jara/devel/oss/questdb-arrays` (`core/src/main/.../cutlass/qwp/server`) +- **Enterprise role/zone/auth/TLS:** `/home/jara/devel/oss/questdb-enterprise` (`questdb-ent/.../cutlass/qwp`) + +> Note on the standalone vs embedded Java client: `QwpQueryClient.java` and +> `QwpHostHealthTracker.java` in `/home/jara/devel/oss/java-questdb-client` are +> **byte-identical** to the copies embedded in `questdb-arrays/java-questdb-client`, +> so the implementation comparison is valid; only the *test sets* differ. + +--- + +## 1. Verdict + +The egress failover is a **faithful, high-quality port of the Java reference, with no +critical correctness gap against the server wire contract.** In several areas it is +*stronger* than Java (mid-query replay correctness, a `FailoverWouldDuplicate` +anti-duplication guard, bounded TLS-handshake reads, exact dial-budget assertions). + +The real issues are: + +1. **Small behavioral divergences** on the egress path worth a deliberate decision. +2. **A meaningful e2e test-coverage gap** — the Rust/py failover walk is only + unit-tested; the live multi-endpoint role/zone e2e harness drives the *Java* client. + +--- + +## 2. Architecture grounding (read this first) + +There are **two role-signaling mechanisms on two different endpoints.** Conflating +them is the main source of confusion in this area. + +| | Ingress `/write/v4` (line sender) | **Egress `/read/v1` (query reader — focus)** | +|---|---|---| +| Role rejection | HTTP **`421 Misdirected Request` + `X-QuestDB-Role`** (pre-upgrade) | **Always 101**, then an unsolicited binary **`SERVER_INFO` (`0x18`)** frame carrying `role/epoch/capabilities/zone`; client applies its `target=` filter and skips on mismatch | +| Java code | `QwpIngressUpgradeProcessor` / `QwpUpgradeFailures` | `QwpQueryClient.connect()` → `matchesTarget` | +| Rust code | `ingress/sender/qwp_ws*.rs` | `egress/{reader,transport,server_event}.rs` | + +The egress reader handles **both** surfaces (the 421 path defensively, for proxies / +mixed deploys) and matches Java exactly. + +- **Roles:** `STANDALONE=0x00, PRIMARY=0x01, REPLICA=0x02, PRIMARY_CATCHUP=0x03`. +- **Transient vs topological:** only `PRIMARY_CATCHUP` is *transient* (promotion in + flight); every other role, including unrecognized tokens, is *topological*. + **Confirmed identical** in Java (`QwpIngressRoleRejectedException.isTransient`, + lines 84-86) and Rust (`egress/error.rs:202-206` `UpgradeReject::is_transient`). +- **No mid-stream resume contract.** On a dropped read the client must re-issue the + whole query (fresh `request_id`, replay from `batch_seq=0`). The server provides no + ACK/offset/resume token; its only "rollback" is internal connection-scoped + symbol-dict cleanup (`QwpEgressResumeRollbackTest`). + +--- + +## 3. Parity confirmed (what matches) + +- **Host-health tracker** (`egress/tracker.rs` vs `QwpHostHealthTracker.java`): near 1:1 + port — identical `(state, zone_tier)` priority lattice + (`HEALTHY < UNKNOWN < TRANSIENT_REJECT < TRANSPORT_ERROR < TOPOLOGY_REJECT`, + `SAME < UNKNOWN < OTHER`), sticky-healthy semantics, round-based recovery (no timed + expiry / half-open). Backoff constants match to the millisecond: + **8 attempts / 50 ms→1 s full-jitter / 30 s deadline / 15 s auth-timeout / + 5 s server-info-timeout.** +- **Retry budget** (`reader.rs`/`transport.rs` vs `QwpQueryClient`): the + `545f8a6` *"align failover budget with execute attempts"* fix is **correct** — + per-Execute attempt+time budget, initial attempt counted + (`reconnect_rounds = max_attempts - 1`), deadline checked before sleeping, budget + shared across successive mid-query failovers (not reset), and the connect-walk (role + election) kept *out* of the per-query budget. No off-by-one. +- **`RoleMismatch` error-code plumbing is ABI-stable and consistent end-to-end:** + `ErrorCode::RoleMismatch → line_sender_error_role_mismatch = 18` (appended) across + `error.rs`, FFI `lib.rs:294/343`, the C header (`:156`), the ABI tripwire test, and + `line_sender.pxd`/`ingress.pyx`. The reader's pre-existing + `line_reader_error_role_mismatch = 8` (separate enum) also folds into + py `IngressErrorCode.RoleMismatch`. The Python-only sentinel relocation to the + `0x10000` band (`BadDataFrame`, `Cancelled`, `FailoverWouldDuplicate`) genuinely + prevents aliasing of the appended FFI code, and the enum-guard test + (`test_python_only_error_codes_do_not_overlap_ffi_codes`) really catches collisions. +- **Connect/handshake/TLS classification: zero mismatches.** 401/403 → terminal + cluster-wide; refused / TLS / 421 / 426 / 5xx / 404 / malformed / version-mismatch / + timeout → retry-next. Rust is **safer** on one axis: its `auth_timeout` bounds the + TLS-handshake read (lazy rustls during the upgrade), whereas Java does TLS eagerly + with only OS timeouts — so a TLS-layer blackhole is bounded in Rust, unbounded in + Java. Rust connect cleanup is RAII; no FD/native leak found under repeated failover. +- **Enterprise role/zone model is fully expressible** in Rust/py: + `target=any|primary|replica`, `zone=`, `CAP_ZONE=0x0000_0001`, identical wire + bytes, case-insensitive/trimmed zone comparison, `target=primary` collapses zone + tiers to `Same`. Python reaches every knob via the pass-through conf string + (`line_reader_from_conf`). +- **Rust/py is *ahead* of Java** on coverage in: read-side mid-query replay + + schema re-read, the `FailoverWouldDuplicate` streaming guard, distinct + deadline-vs-attempts exhaustion messages, on-wire auth-header byte pinning, and + progress-callback lifecycle assertions. + +--- + +## 4. Findings (prioritized) + +Legend — **Path:** which connect path the finding is on. **Sev:** severity. + +| # | Path | Sev | Finding | Evidence | Recommendation | +|---|------|-----|---------|----------|----------------| +| 1 | Egress | Low | **Terminal-code set is wider in Rust:** `ConfigError` / `UnsupportedServer` / `AuthError` abort the walk; Java aborts only on auth and retries the rest across all hosts. Unclear whether `UnsupportedServer` ever fires on the *connect* path (content-encoding rejection maps to failover-eligible `HandshakeError`). | `reader.rs:305-309,514-518` vs `QwpQueryClient.java:866-868`; `transport.rs:540-545` | Confirm which connect-time condition yields terminal `UnsupportedServer`; align or document. Verdict is identical against a uniformly-bad cluster; only latency/log pressure differs. | +| 2 | Egress | Low | **401 and 403 collapse into one `AuthError`.** Behavior (terminate) is correct, but enterprise tests assert a 401 (bad credential) vs 403 (no grant / disabled) distinction, plus an in-band SQL `SECURITY_ERROR` — diagnostic granularity is lost. | `transport.rs:642`; enterprise `QwpEgressAuthTest`, `QwpWebSocketTlsAclTest:193` | Optional: keep the HTTP status on `AuthError` for diagnostics. | +| 3 | Egress | Low / by-design | **Replay-from-zero with `FailoverWouldDuplicate` guard.** Rust refuses post-data-delivery replay unless an `on_failover_reset` callback is registered; Java replays unconditionally. Rust is *safer*, and this **matches the server contract** (no resume token). Portability gotcha when comparing the two clients. | `reader.rs:434,900-968`; server: no resume (`QwpEgressResumeRollbackTest`) | Keep. Document the "restart-from-zero" expectation for reader callbacks. | +| 4 | Egress | Cosmetic | A query-path role mismatch surfaces under **`IngressErrorCode.RoleMismatch`** — shared enum name, right category, but the `Ingress` prefix reads oddly in a query traceback. | `egress.pxi:29-30` | Consider an alias / rename if the public surface allows. | +| 5 | Ingress | Low | Sender has **no post-handshake `SERVER_INFO` `target=` re-check** (detects role only via 421). Benign **given the contract** — the ingress server 421s a role mismatch *before* upgrading — but it's an unstated asymmetry vs the reader/Java. | `qwp_ws.rs` (absent); reader `reader.rs:361-408` | Confirm the "ingress always 421s first" assumption holds for all deployments/proxies; otherwise add the re-check. | + +--- + +## 5. Test coverage — gaps (ranked) + +1. **No live multi-endpoint e2e for the Rust/py egress walk.** Enterprise + `QwpEgressServerInfoRoleTest` / `QwpEgressServerInfoZoneTest` and the + `QwpEgressSidecarMain` harness stand up real primary+replica nodes — but the sidecar + drives the **Java** `QwpQueryClient`. The Rust/py failover walk is only *unit*-tested + (tracker, decoders, config). **Highest-value gap.** +2. **No live mid-query failover-to-replica replay test for Rust/py.** The mechanism + exists (`on_failover_reset` trampoline, replay, schema re-read) but no integration + test forces a real mid-stream disconnect against a second live endpoint and asserts a + complete, strictly-ascending result. Java has + `QwpEgressServerInfoRoleTest::testFailoverToReplicaReplaysAfterMidStreamDisconnect` + via a server debug hook. +3. **403 / 404 / 426 classification not pinned in Rust tests** (only 401/421/version). + The *code* is correct (403→`AuthError` terminal; 404/426→failover-eligible), so these + are missing *tests*, not bugs — but the 401-vs-404 "is this terminal?" boundary + deserves a guard. Java pins all of these in `QwpQueryClientWalkTrackerTest`. +4. **Connect-failure resource-leak assertion absent.** Java has `QueryClientPoolLeakTest` + (native scratch on connect failure). Rust connect is RAII (low risk), but the **py + eager-pool `from_conf` connect-walk-failure** path has no FD / native-memory leak + assertion. +5. **No TLS-failover test anywhere** (Rust suite is `ws://` only) and **no + concurrent-query-during-failover test anywhere** — the + `reader_migrates_to_worker_thread_with_concurrent_stats_polling` test runs queries + *sequentially* while polling atomic stats; it validates `Send`/`Sync`, not concurrent + failover. +6. **Python fakes are HTTP-status stubs** (`_FakeStatusServer`) that never complete the + WS upgrade or emit a `SERVER_INFO` frame. So the Python role-negotiation tests exercise + the 421/401 *upgrade-reject* path but **not the SERVER_INFO-frame role filter** — the + *primary* egress mechanism. The reader-side `line_reader_error_role_mismatch=8 → + py RoleMismatch` mapping is likewise only covered via the sender path. + +--- + +## 6. Test-quality concerns + +- **Weak budget assertions in Python/system tests.** `test_*_exhausts_budget` checks only + that the error *code* is in `{SocketError, ProtocolError, FailoverWouldDuplicate}`, not + the dial count. A double-walk regression would pass. The exact-count contract + (13 dials = 1 + 3×4) lives only in Rust's `attempts_exhausted_surfaces_error`. +- **Timing/sleep flakiness:** + - Python streaming `test_iter_*_surfaces_failover_would_duplicate` relies on a 100M-row + query still producing after the first batch when the server is bounced (comments admit + it "can finish before the bounce"). Most flake-prone Python test. + - Enterprise `test_kill9_primary_failover_no_data_loss` uses `time.sleep(0.5)` before + SIGKILL; 60 s/180 s helper timeouts inflate CI cost. + - Rust `backoff_bounded_by_jitter_ceiling` (<640 ms) and + `failover_callback_runs_before_replayed_read` (100 ms park) are wall-clock asserts + sensitive to loaded CI. +- **Fakes don't model the wire.** Both the Java `FakeStatusServer` and Python + `_FakeStatusServer` answer a fixed HTTP status; they cannot drive the SERVER_INFO frame + path. Exotic wire faults (stalled upgrade, malformed SERVER_INFO, version mismatch) + remain Rust-mock-only. +- **Intentional multi-outcome tolerance** in `add_credit_failover_post_conditions_are_consistent` + (accepts `resets ∈ {0,1}`) is documented and unavoidable, but means that branch is only + deterministically pinned by the `would_silently_duplicate_truth_table` unit test. + +--- + +## 7. Recommended next actions + +1. **Add a live two-endpoint egress e2e** for Rust/py: differing roles + (skip-replica → bind-primary via SERVER_INFO) and a forced mid-stream disconnect → + replay (covers test gaps #1 + #2). Biggest coverage win. +2. **Add cheap classification tests:** 403-terminal, 404/426-walk-past, and a py-side + leak assertion on connect-walk failure. +3. **Resolve the open questions** below before they bite in Enterprise multi-node. + +--- + +## 8. Open questions + +- Does the connect path ever produce a terminal **`UnsupportedServer`**, or only + mid-stream from the zstd decoder? (Finding #1) +- **`epoch` is parsed but unused.** The contract notes clients "tracking a specific + primary use epoch to refuse a stale reconnection." Fine for OSS (epoch always 0) — is + there an intended Enterprise stale-primary refusal not yet wired? +- Does the reader have an **async/retrying initial-connect** mode (Java + `InitialConnectAsyncTest`)? If purely synchronous-connect, those Java tests are + correctly N/A. +- Does the Rust egress upgrade actually **send `X-QWP-Max-Version` / `X-QWP-Client-Id`**? + (Server clamps if absent — not a failover risk, but worth a one-line check in request + construction.) +- Is the **post-connect mutation guard** (`QwpQueryClientPostConnectGuardTest`) + inapplicable because `ReaderConfig` is immutable after `from_conf`? Confirm no + post-connect-mutable knob is exposed via FFI. + +--- + +## Appendix — methodology + +Review fanned out across seven parallel agents, each comparing the Rust implementation +(`c-questdb-client/questdb-rs/src/egress/`) against the Java reference and/or the server +contract: + +| Agent | Area | Primary sources | +|-------|------|-----------------| +| A1 | Host-health tracking & endpoint selection | `egress/tracker.rs` vs `QwpHostHealthTracker.java` (+ test) | +| A2 | Retry loop & retry budget | `egress/{transport,reader,config}.rs` vs `QwpQueryClient.java`; submodule `545f8a6` | +| A3 | Error taxonomy & role-mismatch end-to-end | `error.rs` / FFI / `egress.pxi` vs `QwpRoleMismatchException` & friends | +| A4 | Connect / handshake / auth / TLS | `egress/{ws/client,auth,tls,transport}.rs` vs `WebSocketClient.java` | +| A5 | Test coverage & scenario matrix | `tests/egress_failover.rs`, `test/system_test.py`, `failover_clients/` vs Java failover tests | +| A6 | Server wire contract & e2e | `questdb-arrays/core/.../cutlass/qwp/server` + server-side e2e tests | +| A7 | Enterprise role/zone/auth/TLS | `questdb-enterprise/questdb-ent/.../cutlass/qwp` + enterprise e2e tests | + +Cross-checks performed directly (not via agents): role-transience parity +(`QwpIngressRoleRejectedException.isTransient` vs `UpgradeReject::is_transient`), +standalone-vs-embedded Java client byte-identity, and the failover commit inventory. diff --git a/c-questdb-client b/c-questdb-client index 34905ab2..a4e6ba97 160000 --- a/c-questdb-client +++ b/c-questdb-client @@ -1 +1 @@ -Subproject commit 34905ab227bb95acb27c0b5c38ae31a9a084f2a3 +Subproject commit a4e6ba97d875ccaaa3d3e761f8912e509c786c48 diff --git a/ci/cibuildwheel.yaml b/ci/cibuildwheel.yaml index c0d31767..4c3007d8 100644 --- a/ci/cibuildwheel.yaml +++ b/ci/cibuildwheel.yaml @@ -107,7 +107,7 @@ stages: cmd /c "call `"$vsPath`" && set > env_vars.txt" Get-Content env_vars.txt | ForEach-Object { - if ($_ -match "^([^=]+?)=(.*)$" -and $matches[1] -notmatch '^(SYSTEM|AGENT|BUILD|RELEASE|VSTS|TASK|USE_|FAIL_|MSDEPLOY|AZP_75787|AZP_AGENT|AZP_ENABLE|AZURE_HTTP|COPYFILESOVERSSHV0|ENABLE_ISSUE_SOURCE_VALIDATION|MODIFY_NUMBER_OF_RETRIES_IN_ROBOCOPY|MSBUILDHELPERS_ENABLE_TELEMETRY|RETIRE_AZURERM_POWERSHELL_MODULE|ROSETTA2_WARNING|AZP_PS_ENABLE)') { + if ($_ -match "^([^=]+?)=(.*)$" -and $matches[1] -notmatch '^(SYSTEM|AGENT|BUILD|RELEASE|VSTS|TASK|USE_|FAIL_|MSDEPLOY|AZP_|AZURE_HTTP|COPYFILESOVERSSHV0|ENABLE_ISSUE_SOURCE_VALIDATION|MODIFY_NUMBER_OF_RETRIES_IN_ROBOCOPY|MSBUILDHELPERS_ENABLE_TELEMETRY|RETIRE_AZURERM_POWERSHELL_MODULE|ROSETTA2_WARNING)') { [System.Environment]::SetEnvironmentVariable($matches[1], $matches[2], "Process") Write-Host "##vso[task.setvariable variable=$($matches[1])]$($matches[2])" } @@ -137,7 +137,7 @@ stages: cmd /c "call `"$vsPath`" && set > env_vars.txt" Get-Content env_vars.txt | ForEach-Object { - if ($_ -match "^([^=]+?)=(.*)$" -and $matches[1] -notmatch '^(SYSTEM|AGENT|BUILD|RELEASE|VSTS|TASK|USE_|FAIL_|MSDEPLOY|AZP_75787|AZP_AGENT|AZP_ENABLE|AZURE_HTTP|COPYFILESOVERSSHV0|ENABLE_ISSUE_SOURCE_VALIDATION|MODIFY_NUMBER_OF_RETRIES_IN_ROBOCOPY|MSBUILDHELPERS_ENABLE_TELEMETRY|RETIRE_AZURERM_POWERSHELL_MODULE|ROSETTA2_WARNING|AZP_PS_ENABLE)') { + if ($_ -match "^([^=]+?)=(.*)$" -and $matches[1] -notmatch '^(SYSTEM|AGENT|BUILD|RELEASE|VSTS|TASK|USE_|FAIL_|MSDEPLOY|AZP_|AZURE_HTTP|COPYFILESOVERSSHV0|ENABLE_ISSUE_SOURCE_VALIDATION|MODIFY_NUMBER_OF_RETRIES_IN_ROBOCOPY|MSBUILDHELPERS_ENABLE_TELEMETRY|RETIRE_AZURERM_POWERSHELL_MODULE|ROSETTA2_WARNING)') { [System.Environment]::SetEnvironmentVariable($matches[1], $matches[2], "Process") Write-Host "##vso[task.setvariable variable=$($matches[1])]$($matches[2])" } diff --git a/ci/pip_install_deps.py b/ci/pip_install_deps.py index d70b9761..847a02e1 100644 --- a/ci/pip_install_deps.py +++ b/ci/pip_install_deps.py @@ -101,6 +101,8 @@ def main(args): try_pip_install('fastparquet>=2023.10.1') try_pip_install('pyarrow') + try_pip_install('polars') + try_pip_install('psutil') on_linux_is_glibc = ( (not platform.system() == 'Linux') or diff --git a/ci/run_tests_pipeline.yaml b/ci/run_tests_pipeline.yaml index 80099fb9..9eae9fa4 100644 --- a/ci/run_tests_pipeline.yaml +++ b/ci/run_tests_pipeline.yaml @@ -59,25 +59,41 @@ stages: condition: ne(variables.pandasVersion, '') - script: python3 proj.py build displayName: "Build" - - script: | - git clone --depth 1 https://github.com/questdb/questdb.git - displayName: git clone questdb master + - template: templates/clone_questdb.yaml + parameters: + condition: eq(variables.vsQuestDbMaster, true) + - bash: | + set -euo pipefail + JDK_HOME="${JAVA_HOME_25_X64:-}" + if [ -z "$JDK_HOME" ] || [ ! -x "$JDK_HOME/bin/javac" ]; then + JDK_HOME="/opt/jdk25" + sudo mkdir -p "$JDK_HOME" + curl -fsSL "https://api.adoptium.net/v3/binary/latest/25/ga/linux/x64/jdk/hotspot/normal/eclipse" | + sudo tar -xz -C "$JDK_HOME" --strip-components=1 + fi + # Azure parses ##vso logging commands on both stdout and stderr. + # Keep xtrace off here so bash never emits a quoted stderr copy. + set +x + echo "##vso[task.setvariable variable=JAVA_HOME]$JDK_HOME" + echo "##vso[task.prependpath]$JDK_HOME/bin" + displayName: "Resolve JDK 25" condition: eq(variables.vsQuestDbMaster, true) - task: Maven@3 displayName: "Compile QuestDB master" inputs: mavenPOMFile: "questdb/pom.xml" - jdkVersionOption: "1.17" - options: "-DskipTests -Pbuild-web-console" + javaHomeOption: "Path" + jdkDirectory: "$(JAVA_HOME)" + options: "-DskipTests -Pbuild-web-console$(CLIENT_PROFILE)" condition: eq(variables.vsQuestDbMaster, true) - script: python3 proj.py test 1 displayName: "Test vs released" env: - JAVA_HOME: $(JAVA_HOME_17_X64) + JAVA_HOME: $(JAVA_HOME_25_X64) - script: python3 proj.py test 1 displayName: "Test vs master" env: - JAVA_HOME: $(JAVA_HOME_17_X64) + JAVA_HOME: $(JAVA_HOME) QDB_REPO_PATH: "./questdb" condition: eq(variables.vsQuestDbMaster, true) - job: TestsAgainstVariousNumpyVersion1x diff --git a/ci/templates/clone_questdb.yaml b/ci/templates/clone_questdb.yaml new file mode 100644 index 00000000..5b56c362 --- /dev/null +++ b/ci/templates/clone_questdb.yaml @@ -0,0 +1,52 @@ +# Clone questdb master and decide how the questdb-client jar is supplied. +# +# questdb's build pulls in org.questdb:questdb-client at the version held in +# core/pom.xml's questdb.client.version property. When that version is a +# release (e.g. 1.3.0) the jar is fetched from Maven Central, so a plain +# `git clone` is enough. When it's a -SNAPSHOT (e.g. 1.3.5-SNAPSHOT) the jar +# is published nowhere — it must be built from the java-questdb-client +# submodule via questdb's `local-client` profile. Skipping that step fails +# the questdb build with: +# Could not find artifact org.questdb:questdb-client:jar: +# +# This mirrors questdb/questdb's ci/templates/detect-local-client.yml: read +# the version, and for a SNAPSHOT check out the submodule and export +# CLIENT_PROFILE=",local-client" so the "Compile QuestDB master" Maven step +# appends the profile (options: "... -Pbuild-web-console$(CLIENT_PROFILE)"). +# For a release CLIENT_PROFILE is cleared, leaving the Maven options +# untouched. Reading the version (rather than hard-coding the profile) keeps +# every questdb-building job working whichever way questdb master flips next. +# +# `condition` parameter: this repo only builds questdb on the +# `vsQuestDbMaster` matrix leg, so callers pass that gate down to every step +# in the template (Azure has no `condition` on a `- template:` reference). +parameters: + - name: condition + type: string + default: succeeded() + +steps: + - script: | + git clone --depth 1 https://github.com/questdb/questdb.git + displayName: git clone questdb + condition: ${{ parameters.condition }} + - bash: | + # No `pipefail`: the version extraction is `sed ... | head -1`, and + # core/pom.xml carries the property on more than one line, so head can + # close the pipe early and leave sed with SIGPIPE — under pipefail that + # would fail the step. `set -eu` still aborts on a failed cd / submodule + # checkout. + set -eu + cd questdb + CLIENT_VERSION=$(sed -n 's/.*\(.*\)<\/questdb.client.version>.*/\1/p' core/pom.xml | head -1) + echo "questdb.client.version=$CLIENT_VERSION" + if echo "$CLIENT_VERSION" | grep -q '\-SNAPSHOT$'; then + echo "SNAPSHOT client -> build java-questdb-client submodule, activate local-client profile" + git submodule update --init --depth 1 java-questdb-client + echo "##vso[task.setvariable variable=CLIENT_PROFILE],local-client" + else + echo "Release client -> resolve questdb-client from Maven Central" + echo "##vso[task.setvariable variable=CLIENT_PROFILE]" + fi + displayName: "Detect local client profile" + condition: ${{ parameters.condition }} diff --git a/docs/api.rst b/docs/api.rst index b3e1f11e..a9efd4c2 100644 --- a/docs/api.rst +++ b/docs/api.rst @@ -24,16 +24,56 @@ questdb.ingress :undoc-members: :show-inheritance: +.. autoclass:: questdb.ingress.Client + :members: + :undoc-members: + :show-inheritance: + +.. autoclass:: questdb.ingress.QueryResult + :members: + :undoc-members: + :show-inheritance: + .. autoclass:: questdb.ingress.IngressError :members: :undoc-members: :show-inheritance: +.. autoclass:: questdb.ingress.IngressServerRejectionError + :members: + :undoc-members: + :show-inheritance: + +.. autoclass:: questdb.ingress.UnsupportedDataFrameShapeError + :members: + :undoc-members: + :show-inheritance: + .. autoclass:: questdb.ingress.IngressErrorCode :members: :undoc-members: :show-inheritance: +.. autoclass:: questdb.ingress.QwpWsError + :members: + :undoc-members: + :show-inheritance: + +.. autoclass:: questdb.ingress.QwpWsErrorCategory + :members: + :undoc-members: + :show-inheritance: + +.. autoclass:: questdb.ingress.QwpWsErrorPolicy + :members: + :undoc-members: + :show-inheritance: + +.. autoclass:: questdb.ingress.QwpWsProgress + :members: + :undoc-members: + :show-inheritance: + .. autoclass:: questdb.ingress.Protocol :members: :undoc-members: diff --git a/docs/conf.rst b/docs/conf.rst index 1d0eb6c0..49248825 100644 --- a/docs/conf.rst +++ b/docs/conf.rst @@ -29,6 +29,9 @@ The valid protocols are: * ``tcps``: ILP/TCP with TLS * ``http``: ILP/HTTP * ``https``: ILP/HTTP with TLS +* ``qwpudp``: QWP/UDP (QuestWire Protocol over UDP) +* ``qwpws``: QWP/WebSocket +* ``qwpwss``: QWP/WebSocket with TLS If you're unsure which protocol to use, see :ref:`sender_which_protocol`. @@ -57,15 +60,32 @@ Connection ``host:port``. This key-value pair is mandatory, but the port can be defaulted. - If omitted, the port will be defaulted to 9009 for TCP(s) - and 9000 for HTTP(s). + If omitted, the port will be defaulted to 9009 for TCP(s), + 9000 for HTTP(s) and QWP/WebSocket, and 9007 for QWP/UDP. + +* ``bind_interface`` - TCP/QWP-UDP only, ``str``: Network interface to bind + from. Useful if you have an accelerated network interface (e.g. Solarflare) + and want to use it. -* ``bind_interface`` - TCP-only, ``str``: Network interface to bind from. - Useful if you have an accelerated network interface (e.g. Solarflare) and - want to use it. - The default is ``0.0.0.0``. +* ``max_datagram_size`` - QWP/UDP-only, ``int > 0``: Maximum UDP datagram + payload size in bytes. + + Default: 1400. + +* ``multicast_ttl`` - QWP/UDP-only, ``int (0-255)``: Multicast TTL + (time-to-live) for UDP datagrams. + + Default: 1. + +* ``qwp_ws_progress`` - QWP/WebSocket-only, ``background`` | ``manual``: + Whether frame acknowledgements are progressed on a background thread + (``background``) or only when the sender's drive/await/flush methods are + called (``manual``). + + Default: ``background``. + .. _sender_conf_auth: Authentication @@ -104,7 +124,7 @@ See the :ref:`auth_and_tls_example` example for more details. TLS === -TLS in enabled by selecting the ``tcps`` or ``https`` protocol. +TLS is enabled by selecting the ``tcps``, ``https``, or ``qwpwss`` protocol. See the `QuestDB enterprise TLS documentation `_ on how to enable this feature in the server. @@ -133,6 +153,12 @@ still use TLS by setting up a proxy in front of QuestDB, such as * ``tls_roots`` - ``str``: Path to a PEM-encoded certificate authority file. When used it defaults the ``tls_ca`` to ``'pem_file'``. + For ``qwpwss``, this can also point at a JKS or PKCS#12 keystore when + paired with ``tls_roots_password``. + +* ``tls_roots_password`` - ``str``: Password for the JKS or PKCS#12 keystore + configured by ``tls_roots``. This is supported only for ``qwpwss``. + * ``tls_verify`` - ``'on'`` | ``'unsafe_off'``: Whether to verify the server's certificate. This should only be used for testing as a last resort and never used in production as it makes the connection vulnerable to man-in-the-middle @@ -170,13 +196,13 @@ The following parameters control the :ref:`sender_auto_flush` behavior. * ``auto_flush_rows`` - ``int > 0`` | ``'off'``: The number of rows that will trigger a flush. Set to ``'off'`` to disable. - - *Default: 75000 (HTTP) | 600 (TCP).* + + *Default: 75000 (HTTP) | 600 (TCP, QWP/UDP).* * ``auto_flush_bytes`` - ``int > 0`` | ``'off'``: The number of bytes that will trigger a flush. Set to ``'off'`` to disable. - - Default: ``'off'``. + + *Default: off (TCP, HTTP) | max_datagram_size (QWP/UDP, 1400 by default).* * ``auto_flush_interval`` - ``int > 0`` | ``'off'``: The time in milliseconds that will trigger a flush. Set to ``'off'`` to disable. @@ -228,6 +254,7 @@ Protocol Version ================ Specifies the version of InfluxDB Line Protocol to use. +Not applicable for QWP/UDP senders. Here is a configuration string with ``protocol_version=2`` for ``TCP``:: @@ -281,11 +308,17 @@ The following parameters control the HTTP request behavior. * ``retry_timeout`` - ``int > 0``: The time in milliseconds to continue retrying after a failed HTTP request. The interval between retries is an exponential - backoff starting at 10ms and doubling after each failed attempt up to a - maximum of 1 second. + backoff starting at 10ms and doubling after each failed attempt up to + ``retry_max_backoff_millis``. Default: 10000 (10 seconds). +* ``retry_max_backoff_millis`` - ``int >= 10``: Maximum per-attempt backoff in + milliseconds for the HTTP retry loop. As a ``Sender`` / ``from_conf`` / + ``from_env`` keyword argument this is named ``retry_max_backoff``. + + Default: 1000 (1 second). + * ``request_timeout`` - ``int > 0``: The time in milliseconds to wait for a response from the server. This is in addition to the calculation derived from the ``request_min_throughput`` parameter. diff --git a/docs/examples.rst b/docs/examples.rst index 70e74de3..2ce0f951 100644 --- a/docs/examples.rst +++ b/docs/examples.rst @@ -5,6 +5,42 @@ Examples Basics ====== +.. _qwp_udp_example: + +QWP over UDP +------------ + +The following example sends a row using QuestWire Protocol over UDP. + +Requires a QuestDB instance with QWP/UDP receiver support enabled. The +default listener port is ``9007``. + +.. literalinclude:: ../examples/qwp_udp.py + :language: python + +.. _qwpws_polars_example: + +QWP/WebSocket from Polars +------------------------- + +The following example ingests a Polars ``DataFrame`` over QWP/WebSocket via +:meth:`Client.dataframe`, runs a query into Polars, and includes a +``schema_overrides`` variant. + +.. literalinclude:: ../examples/polars_basic.py + :language: python + +.. _qwpws_pyarrow_example: + +QWP/WebSocket from PyArrow +-------------------------- + +The following example ingests a PyArrow table over QWP/WebSocket via +:meth:`Client.dataframe` and runs a query into PyArrow. + +.. literalinclude:: ../examples/pyarrow_basic.py + :language: python + HTTP with Token Auth -------------------- @@ -64,7 +100,7 @@ Pandas Basics ------------- The following example shows how to insert data from a Pandas DataFrame to the -``'trades'`` table. +``'trades'`` table and run a generated query into Pandas. .. literalinclude:: ../examples/pandas_basic.py :language: python @@ -76,7 +112,8 @@ For details on all options, see the ``pd.Categorical`` and multiple tables -------------------------------------- -The next example shows some more advanced features inserting data from Pandas. +The next example shows some more advanced features inserting data from Pandas +and running a generated query into Pandas. * The data is sent to multiple tables. @@ -99,7 +136,8 @@ For details on all options, see the Loading Pandas from a Parquet File ---------------------------------- -The following example shows how to load a Pandas DataFrame from a Parquet file. +The following example shows how to load a Pandas DataFrame from a Parquet file +and run a generated query into Pandas. The example also relies on the dataframe's index name to determine the table name. diff --git a/docs/installation.rst b/docs/installation.rst index dc2f0405..b83d4d81 100644 --- a/docs/installation.rst +++ b/docs/installation.rst @@ -6,55 +6,46 @@ Dependency ========== The Python QuestDB client does not have any additional run-time dependencies and -will run on any version of Python >= 3.9 on most platforms and architectures. +will run on any version of Python >= 3.10 on most platforms and architectures. From version 3.0.0, this library depends on ``numpy>=1.21.0``. Optional Dependencies --------------------- -Ingesting dataframes also require the following -dependencies to be installed: +The ``dataframe`` extra bundles ``pandas`` and ``pyarrow``: -* ``pandas`` -* ``pyarrow`` +* ``dataframe`` → ``pandas`` and ``pyarrow`` -These are bundled as the ``dataframe`` extra. +Install it to ingest a **pandas** DataFrame, or to use the +``to_pandas`` / ``to_arrow`` / ``iter_*`` helpers on ``Client.query()`` +results. polars, pyarrow, duckdb and any other Arrow-native source need +no extra — they go through the Arrow PyCapsule Interface; just install +the source library as usual. -Without this option, you may still ingest data row-by-row. +Without it, you may still ingest data row-by-row through +``Sender.row()`` and ``Buffer.row()``, and read query results through +the ``__arrow_c_stream__`` PyCapsule protocol. PIP --- -You can install it (or update it) globally by running:: +DataFrame ingest (pandas + pyarrow):: python3 -m pip install -U questdb[dataframe] +Row-only:: -Or, from within a virtual environment:: - - pip install -U questdb[dataframe] - - -If you don't need to work with dataframes:: - python3 -m pip install -U questdb Poetry ------ -If you're using poetry, you can add ``questdb`` as a dependency:: +Equivalents for poetry:: poetry add questdb[dataframe] - -Similarly, if you don't need to work with dataframes:: - poetry add questdb -or to update the dependency:: - - poetry update questdb - Verifying the Installation ========================== @@ -65,13 +56,13 @@ following statements from a ``python3`` interactive shell: .. code-block:: python >>> import questdb.ingress - >>> buf = questdb.ingress.Buffer() - >>> buf.row('test', symbols={'a': 'b'}) + >>> buf = questdb.ingress.Buffer.ilp() + >>> buf.row('test', symbols={'a': 'b'}, columns={'x': 1}, at=questdb.ingress.ServerTimestamp) - >>> str(buf) - 'test,a=b\n' + >>> bytes(buf) + b'test,a=b x=1i\n' -If you also want to if check you can serialize from Pandas +If you also want to check you can serialize from Pandas (which requires additional dependencies): .. code-block:: python @@ -79,7 +70,8 @@ If you also want to if check you can serialize from Pandas >>> import questdb.ingress >>> import pandas as pd >>> df = pd.DataFrame({'a': [1, 2]}) - >>> buf = questdb.ingress.Buffer() - >>> buf.dataframe(df, table_name='test') - >>> str(buf) - 'test a=1i\ntest a=2i\n' + >>> buf = questdb.ingress.Buffer.ilp() + >>> buf.dataframe(df, table_name='test', at=questdb.ingress.ServerTimestamp) + + >>> bytes(buf) + b'test a=1i\ntest a=2i\n' diff --git a/docs/sender.rst b/docs/sender.rst index 5bdb71d6..e757c194 100644 --- a/docs/sender.rst +++ b/docs/sender.rst @@ -9,9 +9,9 @@ Overview The :class:`Sender ` class is a client that inserts rows into QuestDB via the -`ILP protocol `_, with -support for both ILP over TCP and the newer and recommended ILP over HTTP. -The sender also supports TLS and authentication. +`ILP protocol `_ (TCP +and HTTP) or via QWP/UDP for fire-and-forget, lowest-latency ingestion. +The sender also supports TLS and authentication (ILP only). .. code-block:: python @@ -462,7 +462,8 @@ Prefer ILP/HTTP --------------- Use the ILP/HTTP protocol instead of ILP/TCP for better error reporting and -transaction control. +transaction control. Use QWP/UDP only when you need fire-and-forget, +lowest-latency ingestion and can tolerate potential data loss. .. _sender_tips_connection_reuse: @@ -540,11 +541,16 @@ serialisation logic from the sending logic. Note that the sender's auto-flushing logic will not apply to independent buffers. +You can create a standalone buffer with :func:`Buffer.ilp` (for ILP senders) +or :func:`Buffer.qwp` (for QWP/UDP senders). Alternatively, call +:func:`Sender.new_buffer` which creates the correct buffer type matching the +sender's protocol. + .. code-block:: python from questdb.ingress import Buffer, Sender, TimestampNanos - buf = Buffer() + buf = Buffer.ilp(protocol_version=2) buf.row( 'trades', symbols={'symbol': 'ETH-USD', 'side': 'sell'}, @@ -576,7 +582,7 @@ databases via the ``.flush(buf, clear=False)`` option. from questdb.ingress import Buffer, Sender, TimestampNanos - buf = Buffer() + buf = Buffer.ilp(protocol_version=2) buf.row( 'trades', symbols={'symbol': 'ETH-USD', 'side': 'sell'}, @@ -835,13 +841,118 @@ See the :ref:`sender_conf_protocol_version` section for more details. .. _sender_which_protocol: -ILP/TCP or ILP/HTTP -=================== +Which protocol? +=============== + +The sender supports ``tcp``, ``tcps``, ``http``, ``https``, ``qwpudp``, +``qwpws``, and ``qwpwss`` protocols. + +**You should prefer to use ILP/HTTP in most cases as it provides better +feedback on errors and transaction control.** + +.. _sender_qwp_udp: + +QWP/UDP +------- + +QWP/UDP (``qwpudp``) uses fire-and-forget UDP datagrams for lowest-latency +ingestion. It does not support authentication, TLS, or transactions. The +default port is 9007. See the :ref:`qwp_udp_example` example. + +Key differences from ILP: + +* **No delivery guarantee.** UDP datagrams may be dropped under load or network + congestion. There is no retry mechanism and the server sends no + acknowledgement. Use ILP/HTTP if you need reliable delivery. -The sender supports ``tcp``, ``tcps``, ``http``, and ``https`` protocols. +* **No error feedback.** If a row contains invalid data (e.g. wrong column type + for an existing table), the server silently drops it. With ILP/HTTP you would + get an error response. -**You should prefer to use the new ILP/HTTP protocol instead of ILP/TCP in most -cases as it provides better feedback on errors and transaction control.** +* **Buffer inspection.** ``bytes(sender)`` returns ``b''`` because QWP encoding + is deferred to flush. ``len(sender)`` returns an estimated size hint, not the + exact serialized byte count. + +* **Standalone buffers.** Use :func:`Buffer.qwp` (not :func:`Buffer.ilp`) to + create standalone QWP buffers. Alternatively, use :func:`Sender.new_buffer` + which creates the correct buffer type automatically. + +* **Auto-flush.** ``auto_flush_bytes`` defaults to ``max_datagram_size`` (1400 + by default) so that rows are flushed when the buffer approaches a single + datagram's worth of data. Rows and interval thresholds work the same as ILP. + +* **Datagram size limit.** A single row that exceeds ``max_datagram_size`` will + raise :class:`IngressError` at flush time. Configure ``max_datagram_size`` via + the constructor or :ref:`configuration string `. + +* **No protocol version.** QWP has its own versioning. The ``protocol_version`` + parameter and property are not applicable and will raise an error. + +.. _sender_qwp_ws: + +QWP/WebSocket +------------- + +QWP/WebSocket (``qwpws``, or ``qwpwss`` for TLS) is an acknowledged streaming +transport. Each flush publishes a frame identified by a monotonically +increasing **frame sequence number (FSN)**; the server acknowledges frames as +it durably applies them, so the client can confirm delivery. + +* **Confirming delivery.** :func:`Sender.flush_and_get_fsn` flushes and returns + the FSN of the published frame; :func:`Sender.flush_and_keep_and_get_fsn` + does the same without clearing the buffer. :func:`Sender.await_acked_fsn` + blocks until a given FSN is acknowledged (or a timeout elapses), and + :func:`Sender.acked_fsn` / :func:`Sender.published_fsn` report progress + without blocking. + +* **Progress modes.** With the default ``qwp_ws_progress=background``, + acknowledgements are progressed on a background thread. With + ``qwp_ws_progress=manual``, the application must call + :func:`Sender.drive_once` (or one of the flush/await methods) to pump the + connection. + +* **Server diagnostics.** Per-frame server feedback is delivered to the + ``qwp_ws_error_handler`` callback, or polled via + :func:`Sender.poll_qwp_ws_error` as :class:`QwpWsError` values + (:func:`Sender.qwp_ws_errors_dropped` reports how many were dropped when no + handler kept up). A diagnostic with a ``halt`` policy is terminal: the next + sender call raises :class:`IngressServerRejectionError`. + +* **Draining on close.** :func:`Sender.close_drain` waits for outstanding + frames to be acknowledged before closing. + +* **Standalone buffers.** As with QWP/UDP, use :func:`Buffer.qwp` or + :func:`Sender.new_buffer`. + +.. _query_egress: + +Querying data +============= + +:class:`Client` reads query results back over the QWP/WebSocket read endpoint. +:func:`Client.query` returns a single-use :class:`QueryResult` that streams rows +as Arrow record batches:: + + with qi.Client.from_conf('qwpws::addr=localhost:9000;') as client: + with client.query('SELECT * FROM trades WHERE ts > $1') as result: + df = result.to_pandas() + +A :class:`QueryResult` can be materialised with ``to_arrow`` / ``to_pandas`` or +streamed batch-by-batch with ``iter_arrow`` / ``iter_pandas``. ``to_arrow`` / +``iter_arrow`` (and ``to_pandas`` / ``iter_pandas`` with ``dtype_backend`` or +``types_mapper``) require pyarrow; the default ``to_pandas`` / ``iter_pandas`` +are pyarrow-free. It also implements the Arrow C stream PyCapsule protocol +(``__arrow_c_stream__``), so ``polars.from_arrow(result)`` or +``duckdb.from_arrow(result)`` consume it directly without pyarrow installed. +Each result is consumed once; call :func:`QueryResult.cancel` to ask the server +to stop streaming and :func:`QueryResult.close` to release resources. + +The same :class:`Client` can ingest dataframes through the pooled columnar QWP +path with :func:`Client.dataframe`. Adding ``sf_dir=...`` to +:func:`Client.from_conf` opts dataframe ingestion into the Rust +store-and-forward column-sender backend. The dataframe method still waits for +``AckLevel::Ok`` before returning; only lower-level columnar flush APIs return +after local queue acceptance. ILP/HTTP is available from: @@ -862,6 +973,9 @@ auto-detection. | | ``protocol_version=N`` to to match a version supported by | | | the server. | +----------------+--------------------------------------------------------------+ +| QWP/UDP | **N/A**: QWP uses its own wire format. The | +| | ``protocol_version`` setting is not applicable. | ++----------------+--------------------------------------------------------------+ .. note:: @@ -880,6 +994,7 @@ Either way, you can easily switch between the two protocols by changing: * The ```` part of the :ref:`configuration string `. -* The port number (ILP/TCP default is 9009, ILP/HTTP default is 9000). +* The port number (ILP/TCP default is 9009, ILP/HTTP default is 9000, + QWP/UDP default is 9007). * Any :ref:`authentication parameters ` such as ``username``, ``token``, et cetera. diff --git a/examples.manifest.yaml b/examples.manifest.yaml index e47f8b5b..83ae473f 100644 --- a/examples.manifest.yaml +++ b/examples.manifest.yaml @@ -11,6 +11,22 @@ ``` python3 -m pip install -U questdb ``` +- name: qwpudp + lang: python + path: examples/qwp_udp.py + header: |- + Python client library [docs](https://py-questdb-client.readthedocs.io/en/latest/) + and [repo](https://github.com/questdb/py-questdb-client). + + See more [examples](https://py-questdb-client.readthedocs.io/en/latest/examples.html), + including ingesting data from Pandas dataframes. + + ``` + python3 -m pip install -U questdb + ``` + addr: + host: localhost + port: 9007 - name: ilp-auth lang: python path: examples/auth.py @@ -61,4 +77,31 @@ python3 -m pip install -U questdb ``` conf: http::addr=localhost:9000; +- name: qwpws-polars + lang: python + path: examples/polars_basic.py + header: |- + Python client library [docs](https://py-questdb-client.readthedocs.io/en/latest/) + and [repo](https://github.com/questdb/py-questdb-client). + + See more [examples](https://py-questdb-client.readthedocs.io/en/latest/examples.html), + including ingesting data from Pandas dataframes. + + ``` + python3 -m pip install -U 'questdb[dataframe]' polars + ``` + conf: qwpws::addr=localhost:9000; +- name: qwpws-pyarrow + lang: python + path: examples/pyarrow_basic.py + header: |- + Python client library [docs](https://py-questdb-client.readthedocs.io/en/latest/) + and [repo](https://github.com/questdb/py-questdb-client). + + See more [examples](https://py-questdb-client.readthedocs.io/en/latest/examples.html), + including ingesting data from Pandas dataframes. + ``` + python3 -m pip install -U 'questdb[dataframe]' + ``` + conf: qwpws::addr=localhost:9000; diff --git a/examples/pandas_advanced.py b/examples/pandas_advanced.py index 1e3215e6..a844176a 100644 --- a/examples/pandas_advanced.py +++ b/examples/pandas_advanced.py @@ -1,4 +1,4 @@ -from questdb.ingress import Sender, IngressError +from questdb.ingress import Client, Sender, IngressError import sys import pandas as pd @@ -19,12 +19,22 @@ def example(host: str = 'localhost', port: int = 9000): pd.Timestamp('2022-08-06 07:35:23.189062')]}) try: with Sender.from_conf(f"http::addr={host}:{port};") as sender: + # Ingress: publish a Pandas DataFrame into QuestDB. sender.dataframe( df, table_name_col='metric', # Table name from 'metric' column. symbols='auto', # Category columns as SYMBOL. (Default) at=-1) # Last column contains the designated timestamps. + with Client.from_conf(f"qwpws::addr={host}:{port};") as client: + # Egress: query QuestDB and materialise the result as Pandas. + with client.query( + "SELECT x AS sample_id, " + "x / 10.0 AS value " + "FROM long_sequence(3)") as result: + queried = result.to_pandas() + print(queried) + except IngressError as e: sys.stderr.write(f'Got error: {e}\n') diff --git a/examples/pandas_basic.py b/examples/pandas_basic.py index ebb3c7ee..04338606 100644 --- a/examples/pandas_basic.py +++ b/examples/pandas_basic.py @@ -1,4 +1,4 @@ -from questdb.ingress import Sender, IngressError +from questdb.ingress import Client, Sender, IngressError import sys import pandas as pd @@ -13,12 +13,22 @@ def example(host: str = 'localhost', port: int = 9000): 'timestamp': pd.to_datetime(['2021-01-01', '2021-01-02'])}) try: with Sender.from_conf(f"http::addr={host}:{port};") as sender: + # Ingress: publish a Pandas DataFrame into QuestDB. sender.dataframe( df, table_name='trades', # Table name to insert into. symbols=['symbol', 'side'], # Columns to be inserted as SYMBOL types. at='timestamp') # Column containing the designated timestamps. + with Client.from_conf(f"qwpws::addr={host}:{port};") as client: + # Egress: query QuestDB and materialise the result as Pandas. + with client.query( + "SELECT x AS trade_id, " + "x * 10.0 AS price " + "FROM long_sequence(3)") as result: + queried = result.to_pandas() + print(queried) + except IngressError as e: sys.stderr.write(f'Got error: {e}\n') diff --git a/examples/pandas_parquet.py b/examples/pandas_parquet.py index 54af4e4d..16e1bce7 100644 --- a/examples/pandas_parquet.py +++ b/examples/pandas_parquet.py @@ -1,4 +1,4 @@ -from questdb.ingress import Sender +from questdb.ingress import Client, Sender import pandas as pd @@ -35,9 +35,19 @@ def example(host: str = 'localhost', port: int = 9000): df = pd.read_parquet(filename) with Sender.from_conf(f"http::addr={host}:{port};") as sender: + # Ingress: publish a Pandas DataFrame into QuestDB. # Note: Table name is looked up from the dataframe's index name. sender.dataframe(df, at='ts') + with Client.from_conf(f"qwpws::addr={host}:{port};") as client: + # Egress: query QuestDB and materialise the result as Pandas. + with client.query( + "SELECT x AS charger_id, " + "x * 25 AS speed_kwh " + "FROM long_sequence(3)") as result: + queried = result.to_pandas() + print(queried) + if __name__ == '__main__': example() diff --git a/examples/polars_basic.py b/examples/polars_basic.py new file mode 100644 index 00000000..bba24dce --- /dev/null +++ b/examples/polars_basic.py @@ -0,0 +1,86 @@ +"""Polars DataFrame ingest and query example. + +`Client.dataframe()` accepts polars `DataFrame` and `LazyFrame` directly, +riding the Arrow PyCapsule Interface (`__arrow_c_stream__`) straight into +`column_sender_flush_arrow_batch`. `Client.query()` can materialise query +results as a polars `DataFrame` with `QueryResult.to_polars()`. +""" + +from questdb.ingress import Client, IngressError +import datetime +import sys + + +def example(host: str = 'localhost', port: int = 9000): + import polars as pl + + df = pl.DataFrame({ + 'symbol': ['ETH-USD', 'BTC-USD', 'ETH-USD'], + 'side': ['sell', 'buy', 'buy'], + 'price': [2615.54, 67234.12, 2620.88], + 'amount': [0.00044, 0.0012, 0.00033], + 'ts': [ + datetime.datetime(2025, 1, 1, 12, 0, 0), + datetime.datetime(2025, 1, 1, 12, 0, 1), + datetime.datetime(2025, 1, 1, 12, 0, 2), + ], + }) + + try: + conf = f'qwpws::addr={host}:{port};' + with Client.from_conf(conf) as client: + # Ingress: publish a Polars DataFrame into QuestDB. + client.dataframe(df, table_name='trades', at='ts') + + client.dataframe( + df, + table_name='trades_chunked', + at='ts', + max_rows_per_batch=2) + + # Egress: query QuestDB and materialise the result as Polars. + with client.query( + "SELECT x AS trade_id, " + "x * 10.0 AS price, " + "timestamp_sequence(" + "'2025-01-01T12:00:00.000000Z', 1000000) AS ts " + "FROM long_sequence(3)") as result: + queried = result.to_polars() + print(queried) + except IngressError as e: + sys.stderr.write(f'Got error: {e}\n') + + +def schema_overrides_example(host: str = 'localhost', port: int = 9000): + """B-class wire types (IPv4 / Geohash / etc.) need an explicit hint + because `arr.dtype` alone cannot disambiguate them from plain + integer columns. `schema_overrides` injects the corresponding + `questdb.*` Arrow Field metadata. Requires pyarrow. + """ + import polars as pl + + df = pl.DataFrame({ + 'addr': [0x0A000001, 0xC0A80101, 0x7F000001], + 'price': [100, 200, 300], + 'ts': [ + datetime.datetime(2025, 1, 1, 12, 0, 0), + datetime.datetime(2025, 1, 1, 12, 0, 1), + datetime.datetime(2025, 1, 1, 12, 0, 2), + ], + }, schema={'addr': pl.UInt32, 'price': pl.Int64, 'ts': pl.Datetime('us')}) + + try: + conf = f'qwpws::addr={host}:{port};' + with Client.from_conf(conf) as client: + client.dataframe( + df, + table_name='ipv4_log', + at='ts', + schema_overrides={'addr': 'ipv4'}) + except IngressError as e: + sys.stderr.write(f'Got error: {e}\n') + + +if __name__ == '__main__': + example() + schema_overrides_example() diff --git a/examples/pyarrow_basic.py b/examples/pyarrow_basic.py new file mode 100644 index 00000000..9f165292 --- /dev/null +++ b/examples/pyarrow_basic.py @@ -0,0 +1,83 @@ +"""PyArrow Table / RecordBatch ingest and query example. + +`Client.dataframe()` accepts any object exposing the Arrow PyCapsule +Interface (`__arrow_c_stream__`) — pyarrow Table, RecordBatch, DuckDB +relations, cudf, etc. — and routes through +`column_sender_flush_arrow_batch` one-shot. No per-column Cython +dispatch, no chunk lifecycle. `Client.query()` can materialise query +results as a pyarrow `Table` with `QueryResult.to_arrow()`. +""" + +from questdb.ingress import Client, IngressError +import sys + + +def example(host: str = 'localhost', port: int = 9000): + import pyarrow as pa + + schema = pa.schema([ + pa.field('symbol', pa.string()), + pa.field('side', pa.string()), + pa.field('price', pa.float64()), + pa.field('amount', pa.float64()), + pa.field('ts', pa.timestamp('us')), + ]) + table = pa.Table.from_pydict({ + 'symbol': ['ETH-USD', 'BTC-USD'], + 'side': ['sell', 'buy'], + 'price': [2615.54, 67234.12], + 'amount': [0.00044, 0.0012], + 'ts': [1735732800_000_000, 1735732801_000_000], + }, schema=schema) + + try: + conf = f'qwpws::addr={host}:{port};' + with Client.from_conf(conf) as client: + # Ingress: publish a PyArrow Table into QuestDB. + client.dataframe(table, table_name='trades', at='ts') + + # Egress: query QuestDB and materialise the result as PyArrow. + with client.query( + "SELECT x AS trade_id, " + "x * 10.0 AS price, " + "timestamp_sequence(" + "'2025-01-01T12:00:00.000000Z', 1000000) AS ts " + "FROM long_sequence(3)") as result: + queried = result.to_arrow() + print(queried) + except IngressError as e: + sys.stderr.write(f'Got error: {e}\n') + + +def schema_metadata_example(host: str = 'localhost', port: int = 9000): + """B-class wire types can be selected either via `schema_overrides` + (wrapper injects metadata for you) or by attaching the metadata + directly on the pyarrow Field. Both are equivalent; this example + shows the direct-attach form. + """ + import pyarrow as pa + + schema = pa.schema([ + pa.field('addr', pa.uint32(), + metadata={b'questdb.column_type': b'ipv4'}), + pa.field('loc', pa.int32(), + metadata={b'questdb.geohash_bits': b'20'}), + pa.field('ts', pa.timestamp('us')), + ]) + table = pa.Table.from_pydict({ + 'addr': [0x0A000001, 0xC0A80101], + 'loc': [0x12345, 0x67890], + 'ts': [1735732800_000_000, 1735732801_000_000], + }, schema=schema) + + try: + conf = f'qwpws::addr={host}:{port};' + with Client.from_conf(conf) as client: + client.dataframe(table, table_name='locations', at='ts') + except IngressError as e: + sys.stderr.write(f'Got error: {e}\n') + + +if __name__ == '__main__': + example() + schema_metadata_example() diff --git a/examples/qwp_udp.py b/examples/qwp_udp.py new file mode 100644 index 00000000..a303b1e5 --- /dev/null +++ b/examples/qwp_udp.py @@ -0,0 +1,35 @@ +from questdb.ingress import Sender, Protocol, IngressError, TimestampNanos +import sys + + +def example( + host: str = 'localhost', + port: int = 9007, + table_name: str = 'trades'): + try: + with Sender( + Protocol.QwpUdp, + host, + port, + max_datagram_size=1400) as sender: + sender.row( + table_name, + symbols={ + 'symbol': 'ETH-USD', + 'side': 'sell'}, + columns={ + 'price': 2615.54, + 'amount': 0.00044, + }, + at=TimestampNanos.now()) + + # QWP/UDP defaults `auto_flush_bytes` to the datagram size. + # Flush manually here to send the row immediately. + sender.flush() + + except IngressError as e: + sys.stderr.write(f'Got error: {e}\n') + + +if __name__ == '__main__': + example() diff --git a/proj.py b/proj.py index 2f27c966..be7a057d 100755 --- a/proj.py +++ b/proj.py @@ -125,6 +125,22 @@ def benchmark(*args): _run('python3', 'test/benchmark.py', '-v', *args, env=env) +@command +def pandas_to_questdb_throughput(*args): + """WS-7 headline ingress run (QWP_DATAFRAME_BENCH_PLAN.md s4). + + Runs the s1-narrow columnar-populate floor + the cold/warm e2e split + (in-process mock server) + the populate_plus_encode sum. Pass extra args + through to the harness, e.g. ``--rows 10000000 --pretty`` or + ``--real-conf qwpws::addr=... --real-http http://...`` to add the + live-server real-client number. Ack level is Ok; Durable is Enterprise and + deferred. + """ + env = {'TEST_QUESTDB_PATCH_PATH': '1'} + _run('python3', 'test/benchmark_pandas_columnar.py', '--headline', + '--schema', 's1-narrow', *args, env=env) + + @command def gdb_test(*args): env = {'TEST_QUESTDB_PATCH_PATH': '1', 'PYTHONMALLOC': 'malloc'} diff --git a/pyproject.toml b/pyproject.toml index 48d8f427..44bff075 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -29,7 +29,7 @@ email = "adam@questdb.com" [project.optional-dependencies] publish = ["twine", "wheel"] ci = ["cibuildwheel"] -dataframe = ["pandas", "pyarrow", "numpy"] +dataframe = ["pandas>=1.3.5", "pyarrow>=10.0.1"] [project.urls] Homepage = "https://questdb.com/" diff --git a/setup.py b/setup.py index 74438319..59496db1 100755 --- a/setup.py +++ b/setup.py @@ -94,7 +94,9 @@ def ingress_extension(): extra_objects=extra_objects, depends=depends, define_macros = [ - ('NPY_NO_DEPRECATED_API', 'NPY_1_7_API_VERSION') + ('NPY_NO_DEPRECATED_API', 'NPY_1_7_API_VERSION'), + ('QUESTDB_CLIENT_HAS_ARROW', '1'), + ('QUESTDB_CLIENT_ENABLE_ARROW', '1'), ] ) @@ -146,7 +148,8 @@ def cargo_build(): else: del env['CXX'] subprocess.check_call( - cargo_args + ['--features', 'confstr-ffi'], + cargo_args + ['--features', + 'confstr-ffi,insecure-skip-verify,sync-reader-ws,arrow'], cwd=str(PROJ_ROOT / 'c-questdb-client' / 'questdb-rs-ffi'), env=env) @@ -175,11 +178,10 @@ def readme(): name='questdb', version='4.1.0', platforms=['any'], - python_requires='>=3.8', - install_requires=[], + python_requires='>=3.10', + install_requires=['numpy>=1.21.0'], ext_modules = cythonize([ingress_extension()], annotate=True), cmdclass={'build_ext': questdb_build_ext}, zip_safe = False, package_dir={'': 'src'}, - test_suite="tests", - packages=find_packages('src', exclude=['test'])) + packages=find_packages('src')) diff --git a/src/questdb/arrow_c_data_interface.pxd b/src/questdb/arrow_c_data_interface.pxd index 8c0b5472..adbf6eef 100644 --- a/src/questdb/arrow_c_data_interface.pxd +++ b/src/questdb/arrow_c_data_interface.pxd @@ -36,3 +36,10 @@ cdef extern from "arrow_c_data_interface.h": void (*release)(ArrowArray*) # Opaque producer-specific data void* private_data + + cdef struct ArrowArrayStream: + int (*get_schema)(ArrowArrayStream*, ArrowSchema* out) noexcept + int (*get_next)(ArrowArrayStream*, ArrowArray* out) noexcept + const char* (*get_last_error)(ArrowArrayStream*) noexcept + void (*release)(ArrowArrayStream*) noexcept + void* private_data diff --git a/src/questdb/dataframe.pxi b/src/questdb/dataframe.pxi index 6702e83f..cd758e11 100644 --- a/src/questdb/dataframe.pxi +++ b/src/questdb/dataframe.pxi @@ -1,6 +1,8 @@ # See: dataframe.md for technical overview. from decimal import Decimal +import ipaddress as _ipaddress +import uuid as _uuid from cpython.bytes cimport PyBytes_AsString from .mpdecimal_compat cimport decimal_pyobj_to_binary @@ -81,6 +83,7 @@ cdef struct col_cursor_t: ArrowArray* chunk # Current chunk. size_t chunk_index size_t offset # i.e. the element index (not byte offset) + bint dictionary_large_offsets cdef enum col_target_t: @@ -95,6 +98,22 @@ cdef enum col_target_t: col_target_column_arr_f64 = 8 col_target_column_decimal = 9 col_target_at = 10 + # Narrow numeric targets used by the column-QWP path only. Each + # maps to a dedicated wire type (BYTE / SHORT / INT / FLOAT) + # instead of widening to LONG / DOUBLE. Selected by + # `_FIELD_TARGETS_QWP`, which puts these ahead of the wide + # targets so the resolver picks them for Arrow narrow sources; + # row-ILP uses `_FIELD_TARGETS_ROW`, which does not list them. + col_target_column_i8 = 11 + col_target_column_i16 = 12 + col_target_column_i32 = 13 + col_target_column_f32 = 14 + col_target_column_uuid = 15 + col_target_column_long256 = 16 + col_target_column_ipv4 = 17 + col_target_column_binary = 18 + # Generic Arrow field passthrough to the Rust importer; column-QWP only. + col_target_column_arrow = 19 cdef dict _TARGET_NAMES = { @@ -109,6 +128,15 @@ cdef dict _TARGET_NAMES = { col_target_t.col_target_column_arr_f64: "array", col_target_t.col_target_column_decimal: "decimal", col_target_t.col_target_at: "designated timestamp", + col_target_t.col_target_column_i8: "byte", + col_target_t.col_target_column_i16: "short", + col_target_t.col_target_column_i32: "int", + col_target_t.col_target_column_f32: "float32", + col_target_t.col_target_column_uuid: "uuid", + col_target_t.col_target_column_long256: "long256", + col_target_t.col_target_column_ipv4: "ipv4", + col_target_t.col_target_column_binary: "binary", + col_target_t.col_target_column_arrow: "arrow", } @@ -150,12 +178,30 @@ cdef enum col_source_t: col_source_dt64ns_tz_arrow = 502000 col_source_dt64us_numpy = 601000 col_source_dt64us_tz_arrow = 602000 + # Designated-`at` only (columnar): widened to micros in Rust by the + # millis/seconds designated-timestamp FFI. + col_source_dt64ms_tz_arrow = 603000 + col_source_dt64s_tz_arrow = 604000 col_source_arr_f64_numpyobj = 701100 col_source_decimal_pyobj = 801100 col_source_decimal32_arrow = 802000 col_source_decimal64_arrow = 803000 col_source_decimal128_arrow = 804000 col_source_decimal256_arrow = 805000 + # FixedSizeBinary(16) — the canonical Arrow shape egress emits + # for UUID columns (with or without the `arrow.uuid` extension + # wrapper, which we strip on input). Column-QWP only; row-ILP + # has no serializer for this source. + col_source_fsb16_arrow = 901000 + # FixedSizeBinary(32) — the canonical shape egress emits for + # LONG256 columns. Column-QWP only. + col_source_fsb32_arrow = 902000 + # PyObject sniff outputs for QuestDB-specific wire kinds. + col_source_uuid_pyobj = 903100 + col_source_ipv4_pyobj = 904100 + col_source_datetime_pyobj = 905100 + col_source_bytes_pyobj = 906100 + col_source_arrow_passthrough = 1000000 cdef bint col_source_needs_gil(col_source_t source) noexcept nogil: @@ -179,9 +225,16 @@ cdef dict _PYOBJ_SOURCE_DESCR = { col_source_t.col_source_float_pyobj: "float", col_source_t.col_source_str_pyobj: "str", col_source_t.col_source_decimal_pyobj: "Decimal", + col_source_t.col_source_uuid_pyobj: "UUID", + col_source_t.col_source_ipv4_pyobj: "IPv4Address", + col_source_t.col_source_datetime_pyobj: "datetime", + col_source_t.col_source_bytes_pyobj: "bytes", } +# Compatibility matrix for the Python dataframe planner. `Client.dataframe()` +# uses the Rust Arrow RecordBatch route as the canonical Arrow policy when its +# public routing constraints are satisfied. cdef dict _TARGET_TO_SOURCES = { col_target_t.col_target_skip: { col_source_t.col_source_nulls, @@ -233,6 +286,38 @@ cdef dict _TARGET_TO_SOURCES = { col_source_t.col_source_f32_arrow, col_source_t.col_source_f64_arrow, }, + col_target_t.col_target_column_i8: { + col_source_t.col_source_i8_arrow, + }, + col_target_t.col_target_column_i16: { + col_source_t.col_source_i16_arrow, + }, + col_target_t.col_target_column_i32: { + col_source_t.col_source_i32_arrow, + }, + col_target_t.col_target_column_f32: { + col_source_t.col_source_f32_arrow, + }, + col_target_t.col_target_column_uuid: { + col_source_t.col_source_fsb16_arrow, + col_source_t.col_source_uuid_pyobj, + }, + col_target_t.col_target_column_long256: { + col_source_t.col_source_fsb32_arrow, + }, + col_target_t.col_target_column_binary: { + col_source_t.col_source_bytes_pyobj, + }, + col_target_t.col_target_column_arrow: { + col_source_t.col_source_arrow_passthrough, + }, + # The Rust Arrow path treats UInt32 as IPV4 only when Arrow field + # metadata says questdb.column_type=ipv4. Pandas drops Arrow field + # metadata before it reaches this planner, so plain UInt32 must + # resolve through col_target_column_i64 instead. + col_target_t.col_target_column_ipv4: { + col_source_t.col_source_ipv4_pyobj, + }, col_target_t.col_target_column_str: { col_source_t.col_source_str_pyobj, col_source_t.col_source_str_utf8_arrow, @@ -245,7 +330,8 @@ cdef dict _TARGET_TO_SOURCES = { col_source_t.col_source_dt64ns_numpy, col_source_t.col_source_dt64ns_tz_arrow, col_source_t.col_source_dt64us_numpy, - col_source_t.col_source_dt64us_tz_arrow + col_source_t.col_source_dt64us_tz_arrow, + col_source_t.col_source_datetime_pyobj, }, col_target_t.col_target_column_arr_f64: { col_source_t.col_source_arr_f64_numpyobj, @@ -262,12 +348,26 @@ cdef dict _TARGET_TO_SOURCES = { col_source_t.col_source_dt64ns_tz_arrow, col_source_t.col_source_dt64us_numpy, col_source_t.col_source_dt64us_tz_arrow, + col_source_t.col_source_dt64ms_tz_arrow, + col_source_t.col_source_dt64s_tz_arrow, + col_source_t.col_source_datetime_pyobj, }, } -# Targets associated with col_meta_target.field. -cdef tuple _FIELD_TARGETS = ( +# Field-target orderings used by `_dataframe_resolve_target` — each +# protocol passes its own ordering so the resolver picks the right +# target on the first hit. +# +# Many Arrow sources sit in multiple targets' source-sets +# (`_TARGET_TO_SOURCES`) on purpose, e.g. `col_source_i8_arrow` lives +# in both `col_target_column_i64` (so row-ILP can serialize it as +# text via the existing i64 dispatch) and `col_target_column_i8` (so +# column-QWP can send it as a BYTE wire type). The two `_FIELD_TARGETS_*` +# tuples disambiguate: row-ILP lists wide targets only; column-QWP +# lists narrow targets first so they win the resolver loop. + +cdef tuple _FIELD_TARGETS_ROW = ( col_target_t.col_target_skip, col_target_t.col_target_column_bool, col_target_t.col_target_column_i64, @@ -277,6 +377,35 @@ cdef tuple _FIELD_TARGETS = ( col_target_t.col_target_column_arr_f64, col_target_t.col_target_column_decimal) +cdef tuple _FIELD_TARGETS_QWP = ( + col_target_t.col_target_skip, + col_target_t.col_target_column_bool, + # Narrow numeric targets first — they own the Arrow narrow + # sources (`i8_arrow`, `f32_arrow`, …) so column-QWP emits the + # corresponding narrow wire types (BYTE / SHORT / INT / FLOAT) + # instead of widening to LONG / DOUBLE. + col_target_t.col_target_column_i8, + col_target_t.col_target_column_i16, + col_target_t.col_target_column_i32, + col_target_t.col_target_column_i64, + # IPV4 remains a column-QWP target for the wire emitter, but this + # pandas planner has no metadata-preserving UInt32 source that can + # select it. Metadata-aware IPV4 routing belongs to the Rust Arrow + # ingestion path. + col_target_t.col_target_column_ipv4, + col_target_t.col_target_column_f32, + col_target_t.col_target_column_f64, + col_target_t.col_target_column_str, + col_target_t.col_target_column_ts, + col_target_t.col_target_column_arr_f64, + col_target_t.col_target_column_decimal, + # QuestDB-extension types whose Arrow source is unique + # (FixedSizeBinary widths). + col_target_t.col_target_column_uuid, + col_target_t.col_target_column_long256, + col_target_t.col_target_column_binary, + col_target_t.col_target_column_arrow) + # Targets that map directly from a meta target. cdef set _DIRECT_META_TARGETS = { @@ -422,6 +551,31 @@ cdef enum col_dispatch_code_t: col_dispatch_code_column_decimal__decimal256_arrow = \ col_target_t.col_target_column_decimal + col_source_t.col_source_decimal256_arrow + col_dispatch_code_column_i8__i8_arrow = \ + col_target_t.col_target_column_i8 + col_source_t.col_source_i8_arrow + col_dispatch_code_column_i16__i16_arrow = \ + col_target_t.col_target_column_i16 + col_source_t.col_source_i16_arrow + col_dispatch_code_column_i32__i32_arrow = \ + col_target_t.col_target_column_i32 + col_source_t.col_source_i32_arrow + col_dispatch_code_column_f32__f32_arrow = \ + col_target_t.col_target_column_f32 + col_source_t.col_source_f32_arrow + col_dispatch_code_column_uuid__fsb16_arrow = \ + col_target_t.col_target_column_uuid + col_source_t.col_source_fsb16_arrow + col_dispatch_code_column_uuid__uuid_pyobj = \ + col_target_t.col_target_column_uuid + col_source_t.col_source_uuid_pyobj + col_dispatch_code_column_long256__fsb32_arrow = \ + col_target_t.col_target_column_long256 + col_source_t.col_source_fsb32_arrow + col_dispatch_code_column_ipv4__u32_arrow = \ + col_target_t.col_target_column_ipv4 + col_source_t.col_source_u32_arrow + col_dispatch_code_column_ipv4__ipv4_pyobj = \ + col_target_t.col_target_column_ipv4 + col_source_t.col_source_ipv4_pyobj + col_dispatch_code_column_ts__datetime_pyobj = \ + col_target_t.col_target_column_ts + col_source_t.col_source_datetime_pyobj + col_dispatch_code_at__datetime_pyobj = \ + col_target_t.col_target_at + col_source_t.col_source_datetime_pyobj + col_dispatch_code_column_binary__bytes_pyobj = \ + col_target_t.col_target_column_binary + col_source_t.col_source_bytes_pyobj + # Int values in order for sorting (as needed for API's sequential coupling). cdef enum meta_target_t: @@ -436,9 +590,14 @@ cdef struct col_setup_t: size_t orig_index Py_buffer pybuf ArrowSchema arrow_schema # Schema of first chunk. + column_sender_arrow_import* arrow_import col_source_t source meta_target_t meta_target col_target_t target + bint large_string_cast_to_utf8 + bint has_override + column_sender_numpy_dtype override_dtype + uint8_t override_geohash_bits cdef struct col_t: @@ -462,6 +621,10 @@ cdef void col_t_release(col_t* col) noexcept: if Py_buffer_obj_is_set(&col.setup.pybuf): PyBuffer_Release(&col.setup.pybuf) # Note: Sets `.pybuf.obj` to NULL. + if col.setup.arrow_import != NULL: + column_sender_arrow_import_free(col.setup.arrow_import) + col.setup.arrow_import = NULL + for chunk_index in range(col.setup.chunks.n_chunks): chunk = &col.setup.chunks.chunks[chunk_index] if chunk.release != NULL: @@ -485,6 +648,44 @@ cdef struct col_t_arr: col_t* d +# Storage for one column's PyObject-sniffed, pre-built typed buffers. +# +# Lifetime is bound to the dataframe_plan_t that owns it. Buffers are +# heap-allocated and freed in `pyobj_built_free`. The columnar emitter +# accesses them via raw pointers (offsets / bytes / data / validity) +# until the chunk flush completes. +cdef struct pyobj_built_t: + # Per-source typed payload. Only one of these is set: + # - str_pyobj : str_offsets + str_bytes + # - int_pyobj : data (int64*) + # - float_pyobj : data (double*) + # - bool_pyobj : data (uint8* — LSB-packed Arrow bitmap, value bits) + void* data + int32_t* str_offsets # NULL except for str_pyobj + uint8_t* str_bytes # NULL except for str_pyobj + size_t str_bytes_len # bytes used (not capacity) + + # Validity bitmap (Arrow LSB-first). NULL when no nulls were seen. + uint8_t* validity + bint has_nulls + + size_t row_count + + +cdef struct dataframe_plan_t: + size_t row_count + size_t col_count + line_sender_table_name c_table_name + int64_t at_value + col_t_arr cols + bint any_cols_need_gil + qdb_pystr_pos str_buf_marker + # Per-column pre-built PyObject buffers, indexed by col_index; + # NULL slot for non-PyObject columns. The outer array is NULL until + # `_dataframe_columnar_prebuild_pyobj` runs. + pyobj_built_t** pyobj_built + + cdef col_t_arr col_t_arr_blank() noexcept nogil: cdef col_t_arr arr arr.size = 0 @@ -512,6 +713,53 @@ cdef void col_t_arr_release(col_t_arr* arr) noexcept: arr.d = NULL +cdef dataframe_plan_t dataframe_plan_blank() noexcept nogil: + cdef dataframe_plan_t plan + plan.row_count = 0 + plan.col_count = 0 + plan.c_table_name.buf = NULL + plan.c_table_name.len = 0 + plan.at_value = 0 + plan.cols = col_t_arr_blank() + plan.any_cols_need_gil = False + plan.str_buf_marker.chain = 0 + plan.str_buf_marker.string = 0 + plan.pyobj_built = NULL + return plan + + +cdef void pyobj_built_free(pyobj_built_t* b) noexcept nogil: + if b == NULL: + return + if b.data != NULL: + free(b.data) + if b.str_offsets != NULL: + free(b.str_offsets) + if b.str_bytes != NULL: + free(b.str_bytes) + if b.validity != NULL: + free(b.validity) + free(b) + + +cdef void dataframe_plan_release(dataframe_plan_t* plan) noexcept: + cdef size_t i + if plan.pyobj_built != NULL: + for i in range(plan.col_count): + pyobj_built_free(plan.pyobj_built[i]) + free(plan.pyobj_built) + plan.pyobj_built = NULL + col_t_arr_release(&plan.cols) + plan.row_count = 0 + plan.col_count = 0 + plan.c_table_name.buf = NULL + plan.c_table_name.len = 0 + plan.at_value = 0 + plan.any_cols_need_gil = False + plan.str_buf_marker.chain = 0 + plan.str_buf_marker.string = 0 + + cdef object _NUMPY = None # module object cdef object _NUMPY_BOOL = None cdef object _NUMPY_UINT8 = None @@ -532,20 +780,12 @@ cdef object _PYARROW = None # module object, if available or None cdef int64_t _NAT = INT64_MIN # pandas NaT +cdef bint _dataframe_count_row_path_emissions = False +cdef uint64_t _dataframe_row_path_emissions = 0 -cdef object _dataframe_may_import_deps(): - """" - Lazily import module dependencies on first use to avoid startup overhead. - - $ cat imp_test.py - import numpy - import pandas - import pyarrow - $ time python3 ./imp_test.py - python3 ./imp_test.py 0.56s user 1.60s system 852% cpu 0.254 total - """ - global _NUMPY, _PANDAS, _PYARROW, _PANDAS_NA +cdef object _dataframe_may_import_deps(): + global _NUMPY, _PANDAS, _PANDAS_NA global _NUMPY_BOOL global _NUMPY_UINT8 global _NUMPY_INT8 @@ -564,11 +804,10 @@ cdef object _dataframe_may_import_deps(): try: import pandas import numpy - import pyarrow except ImportError as ie: raise ImportError( - 'Missing dependencies: `pandas`, `numpy` and `pyarrow` must all ' + - 'be installed to use the `.dataframe()` method. ' + + 'Missing dependencies: `pandas` and `numpy` must be installed ' + + 'to use the `.dataframe()` method. ' + 'See: https://py-questdb-client.readthedocs.io/' + 'en/latest/installation.html.') from ie _NUMPY = numpy @@ -587,9 +826,29 @@ cdef object _dataframe_may_import_deps(): _NUMPY_OBJECT = type(_NUMPY.dtype('object')) _PANDAS = pandas _PANDAS_NA = pandas.NA + + +cdef object _dataframe_require_pyarrow(): + global _PYARROW + if _PYARROW is not None: + return + try: + import pyarrow + except ImportError as ie: + raise ImportError( + '`pyarrow` is required for this DataFrame path ' + '(ArrowDtype columns, pyarrow Table/RecordBatch sources, ' + 'schema_overrides). Install with `pip install pyarrow`.') from ie _PYARROW = pyarrow +def _debug_dataframe_pyarrow_loaded(): + """Internal: True iff `.dataframe()` has lazily imported pyarrow in + this process. Intended for tests that verify a code path stayed + pyarrow-free.""" + return _PYARROW is not None + + cdef object _dataframe_check_is_dataframe(object df): if not isinstance(df, _PANDAS.DataFrame): raise IngressError( @@ -845,18 +1104,32 @@ cdef int _dataframe_classify_timestamp_dtype(object dtype) except -1: 'Raise an issue if you think it should be supported: ' + 'https://github.com/questdb/py-questdb-client/issues.') elif isinstance(dtype, _PANDAS.ArrowDtype): + _dataframe_require_pyarrow() arrow_type = dtype.pyarrow_dtype if arrow_type.id == _PYARROW.lib.Type_TIMESTAMP: if arrow_type.unit == "ns": return col_source_t.col_source_dt64ns_tz_arrow elif arrow_type.unit == "us": return col_source_t.col_source_dt64us_tz_arrow - else: - raise IngressError( - IngressErrorCode.BadDataFrame, - f'Unsupported arrow dtype {dtype} unit {arrow_type.unit}. ' + - 'Raise an issue if you think it should be supported: ' + - 'https://github.com/questdb/py-questdb-client/issues.') + # s / ms fall through: field -> generic Arrow passthrough; + # designated-at -> _dataframe_classify_at_timestamp_dtype. + return 0 + + +cdef int _dataframe_classify_at_timestamp_dtype(object dtype) except -1: + # ms / s designated-`at` Arrow timestamps, widened to micros in Rust by + # the millis/seconds designated-timestamp FFI. Kept out of the shared + # field classifier so timestamp fields still route to the generic Arrow + # passthrough and row-ILP stays untouched. + cdef object arrow_type + if isinstance(dtype, _PANDAS.ArrowDtype): + _dataframe_require_pyarrow() + arrow_type = dtype.pyarrow_dtype + if arrow_type.id == _PYARROW.lib.Type_TIMESTAMP: + if arrow_type.unit == "ms": + return col_source_t.col_source_dt64ms_tz_arrow + elif arrow_type.unit == "s": + return col_source_t.col_source_dt64s_tz_arrow return 0 @@ -865,11 +1138,13 @@ cdef ssize_t _dataframe_resolve_at( col_t_arr* cols, object at, size_t col_count, - int64_t* at_value_out) except -2: + int64_t* at_value_out, + bint columnar) except -2: cdef size_t col_index cdef object dtype cdef PandasCol pandas_col cdef TimestampNanos at_nanos + cdef int at_source if at is None: at_value_out[0] = _AT_IS_SERVER_NOW return -1 @@ -899,10 +1174,21 @@ cdef ssize_t _dataframe_resolve_at( col = &cols.d[col_index] col.setup.meta_target = meta_target_t.meta_target_at return col_index - else: - raise TypeError( - f'Bad argument `at`: Bad dtype `{dtype}` ' + - f'for the {at!r} column: Must be a {_SUPPORTED_DATETIMES} column.') + if columnar: + # ms / s Arrow timestamps resolved to the generic passthrough source + # in `_dataframe_resolve_source_and_buffers`; the buffers are already + # mapped, so override the source to the designated-ts unit and let the + # Rust millis/seconds FFI widen to micros. + at_source = _dataframe_classify_at_timestamp_dtype(dtype) + if at_source != 0: + at_value_out[0] = _AT_IS_SET_BY_COLUMN + col = &cols.d[col_index] + col.setup.source = at_source + col.setup.meta_target = meta_target_t.meta_target_at + return col_index + raise TypeError( + f'Bad argument `at`: Bad dtype `{dtype}` ' + + f'for the {at!r} column: Must be a {_SUPPORTED_DATETIMES} column.') cdef void_int _dataframe_alloc_chunks( @@ -962,6 +1248,7 @@ cdef void_int _dataframe_series_as_pybuf( cdef list _dataframe_series_to_arrow_chunks(PandasCol pandas_col): cdef object array + _dataframe_require_pyarrow() array = _PYARROW.Array.from_pandas(pandas_col.series) if isinstance(array, _PYARROW.ChunkedArray): return array.chunks @@ -998,21 +1285,18 @@ cdef const char* _ARROW_FMT_UTF8_STRING = 'u' cdef const char* _ARROW_FMT_LRG_UTF8_STRING = 'U' +cdef void_int _dataframe_string_series_as_arrow( + PandasCol pandas_col, col_t* col) except -1: + _dataframe_export_arrow_chunks( + _dataframe_series_to_arrow_chunks(pandas_col), + col) + + cdef void_int _dataframe_category_series_as_arrow( PandasCol pandas_col, col_t* col) except -1: cdef const char* format cdef list chunks = _dataframe_series_to_arrow_chunks(pandas_col) - # Pandas 3.x with pyarrow may produce large_string ('U') dictionary - # values. Cast to regular string ('u') so our existing category - # accessors (which use int32 offsets) work unchanged. - if (len(chunks) > 0 and - hasattr(chunks[0].type, 'value_type') and - chunks[0].type.value_type == _PYARROW.large_string()): - target_type = _PYARROW.dictionary( - chunks[0].type.index_type, _PYARROW.string()) - chunks = [chunk.cast(target_type) for chunk in chunks] - _dataframe_export_arrow_chunks(chunks, col) format = col.setup.arrow_schema.format @@ -1030,7 +1314,8 @@ cdef void_int _dataframe_category_series_as_arrow( f'Got {(format).decode("utf-8")!r}.') format = col.setup.arrow_schema.dictionary.format - if (strncmp(format, _ARROW_FMT_UTF8_STRING, 1) != 0): + if (strncmp(format, _ARROW_FMT_UTF8_STRING, 1) != 0 and + strncmp(format, _ARROW_FMT_LRG_UTF8_STRING, 1) != 0): raise IngressError( IngressErrorCode.BadDataFrame, f'Bad column {pandas_col.name!r}: ' + @@ -1039,11 +1324,35 @@ cdef void_int _dataframe_category_series_as_arrow( cdef void_int _dataframe_series_resolve_arrow(PandasCol pandas_col, object arrowtype, col_t *col) except -1: cdef bint is_decimal_col = False + _dataframe_require_pyarrow() + if arrowtype.id == _PYARROW.lib.Type_STRING: + _dataframe_string_series_as_arrow(pandas_col, col) + col.setup.source = col_source_t.col_source_str_utf8_arrow + col.scale = 0 + return 0 + elif arrowtype.id == _PYARROW.lib.Type_LARGE_STRING: + _dataframe_string_series_as_arrow(pandas_col, col) + col.setup.source = col_source_t.col_source_str_lrg_utf8_arrow + col.scale = 0 + return 0 + + # Unwrap pyarrow extension types (e.g. `arrow.uuid` wrapping + # `FixedSizeBinary(16)`) to their storage type so dispatch picks + # the storage-shape source. The wire format is identical for both + # forms; pyarrow may or may not have the extension registered at + # runtime, so we accept either input and produce the same source. + # pyarrow exposes no `Type_EXTENSION` constant in all versions we + # support; check via `BaseExtensionType` instead. + if isinstance(arrowtype, _PYARROW.lib.BaseExtensionType): + arrowtype = arrowtype.storage_type + + cdef object t_dec32 = getattr(_PYARROW.lib, 'Type_DECIMAL32', None) + cdef object t_dec64 = getattr(_PYARROW.lib, 'Type_DECIMAL64', None) _dataframe_series_as_arrow(pandas_col, col) - if arrowtype.id == _PYARROW.lib.Type_DECIMAL32: + if t_dec32 is not None and arrowtype.id == t_dec32: col.setup.source = col_source_t.col_source_decimal32_arrow is_decimal_col = True - elif arrowtype.id == _PYARROW.lib.Type_DECIMAL64: + elif t_dec64 is not None and arrowtype.id == t_dec64: col.setup.source = col_source_t.col_source_decimal64_arrow is_decimal_col = True elif arrowtype.id == _PYARROW.lib.Type_DECIMAL128: @@ -1054,8 +1363,6 @@ cdef void_int _dataframe_series_resolve_arrow(PandasCol pandas_col, object arrow is_decimal_col = True elif arrowtype.id == _PYARROW.lib.Type_BOOL: col.setup.source = col_source_t.col_source_bool_arrow - elif arrowtype.id == _PYARROW.lib.Type_LARGE_STRING: - col.setup.source = col_source_t.col_source_str_lrg_utf8_arrow elif arrowtype.id == _PYARROW.lib.Type_FLOAT: col.setup.source = col_source_t.col_source_f32_arrow elif arrowtype.id == _PYARROW.lib.Type_DOUBLE: @@ -1068,12 +1375,18 @@ cdef void_int _dataframe_series_resolve_arrow(PandasCol pandas_col, object arrow col.setup.source = col_source_t.col_source_i32_arrow elif arrowtype.id == _PYARROW.lib.Type_INT64: col.setup.source = col_source_t.col_source_i64_arrow + elif (arrowtype.id == _PYARROW.lib.Type_FIXED_SIZE_BINARY + and arrowtype.byte_width == 16): + col.setup.source = col_source_t.col_source_fsb16_arrow + elif (arrowtype.id == _PYARROW.lib.Type_FIXED_SIZE_BINARY + and arrowtype.byte_width == 32): + col.setup.source = col_source_t.col_source_fsb32_arrow + elif arrowtype.id == _PYARROW.lib.Type_UINT32: + col.setup.source = col_source_t.col_source_u32_arrow else: - raise IngressError( - IngressErrorCode.BadDataFrame, - f'Unsupported arrow type {arrowtype} for column {pandas_col.name!r}. ' + - 'Raise an issue if you think it should be supported: ' + - 'https://github.com/questdb/py-questdb-client/issues.') + col.setup.source = col_source_t.col_source_arrow_passthrough + col.scale = 0 + return 0 if is_decimal_col: if arrowtype.scale < 0 or arrowtype.scale > 76: raise IngressError( @@ -1145,12 +1458,13 @@ cdef void_int _dataframe_series_sniff_pyobj( 'Unsupported object column containing a numpy array ' + f'of an unsupported element type {arr_type_name}.') elif PyBytes_CheckExact(obj): - raise IngressError( - IngressErrorCode.BadDataFrame, - f'Bad column {pandas_col.name!r}: ' + - 'Unsupported object column containing bytes.' + - 'If this is a string column, decode it first. ' + - 'See: https://stackoverflow.com/questions/40389764/') + col.setup.source = col_source_t.col_source_bytes_pyobj + elif isinstance(obj, _uuid.UUID): + col.setup.source = col_source_t.col_source_uuid_pyobj + elif isinstance(obj, _ipaddress.IPv4Address): + col.setup.source = col_source_t.col_source_ipv4_pyobj + elif isinstance(obj, datetime.datetime): + col.setup.source = col_source_t.col_source_datetime_pyobj elif isinstance(obj, Decimal): col.setup.source = col_source_t.col_source_decimal_pyobj else: @@ -1250,7 +1564,7 @@ cdef void_int _dataframe_resolve_source_and_buffers( _dataframe_series_as_arrow(pandas_col, col) elif isinstance(dtype, _PANDAS.StringDtype): if dtype.storage == 'pyarrow': - _dataframe_series_as_arrow(pandas_col, col) + _dataframe_string_series_as_arrow(pandas_col, col) if strncmp(col.setup.arrow_schema.format, _ARROW_FMT_UTF8_STRING, 1) == 0: col.setup.source = col_source_t.col_source_str_utf8_arrow elif strncmp(col.setup.arrow_schema.format, _ARROW_FMT_LRG_UTF8_STRING, 1) == 0: @@ -1283,13 +1597,13 @@ cdef void_int _dataframe_resolve_source_and_buffers( 'https://github.com/questdb/py-questdb-client/issues.') cdef void_int _dataframe_resolve_target( - PandasCol pandas_col, col_t* col) except -1: + PandasCol pandas_col, col_t* col, tuple field_targets) except -1: cdef col_target_t target cdef set target_sources if col.setup.meta_target in _DIRECT_META_TARGETS: col.setup.target = col.setup.meta_target return 0 - for target in _FIELD_TARGETS: + for target in field_targets: target_sources = _TARGET_TO_SOURCES[target] if col.setup.source in target_sources: col.setup.target = target @@ -1305,6 +1619,12 @@ cdef void _dataframe_init_cursor(col_t* col) noexcept nogil: col.cursor.chunk = col.setup.chunks.chunks col.cursor.chunk_index = 0 col.cursor.offset = col.cursor.chunk.offset + col.cursor.dictionary_large_offsets = ( + col.setup.arrow_schema.dictionary != NULL and + strncmp( + col.setup.arrow_schema.dictionary.format, + _ARROW_FMT_LRG_UTF8_STRING, + 1) == 0) cdef void_int _dataframe_resolve_cols( @@ -1343,14 +1663,15 @@ cdef void_int _dataframe_resolve_cols( cdef void_int _dataframe_resolve_cols_target_name_and_dc( qdb_pystr_buf* b, list pandas_cols, - col_t_arr* cols) except -1: + col_t_arr* cols, + tuple field_targets) except -1: cdef size_t index cdef col_t* col cdef PandasCol pandas_col for index in range(cols.size): col = &cols.d[index] pandas_col = pandas_cols[index] - _dataframe_resolve_target(pandas_col, col) + _dataframe_resolve_target(pandas_col, col, field_targets) if col.setup.source not in _TARGET_TO_SOURCES[col.setup.target]: raise ValueError( f'Bad value: Column {pandas_col.name!r} ' + @@ -1387,7 +1708,8 @@ cdef void_int _dataframe_resolve_args( line_sender_table_name* c_table_name_out, int64_t* at_value_out, col_t_arr* cols, - bint* any_cols_need_gil_out) except -1: + bint* any_cols_need_gil_out, + tuple field_targets) except -1: cdef ssize_t name_col cdef ssize_t at_col @@ -1404,12 +1726,117 @@ cdef void_int _dataframe_resolve_args( table_name_col, col_count, c_table_name_out) - at_col = _dataframe_resolve_at(df, cols, at, col_count, at_value_out) + at_col = _dataframe_resolve_at( + df, cols, at, col_count, at_value_out, + field_targets is _FIELD_TARGETS_QWP) _dataframe_resolve_symbols(df, pandas_cols, cols, name_col, at_col, symbols) - _dataframe_resolve_cols_target_name_and_dc(b, pandas_cols, cols) + _dataframe_resolve_cols_target_name_and_dc( + b, pandas_cols, cols, field_targets) qsort(cols.d, col_count, sizeof(col_t), _dataframe_compare_cols) +cdef void_int _dataframe_plan_build( + qdb_pystr_buf* b, + object df, + object table_name, + object table_name_col, + object symbols, + object at, + dataframe_plan_t* plan, + tuple field_targets) except -1: + _dataframe_may_import_deps() + _dataframe_check_is_dataframe(df) + plan.row_count = len(df) + if (len(df.columns) == 0) or (plan.row_count == 0): + plan.col_count = 0 + return 0 + + plan.col_count = len(df.columns) + qdb_pystr_buf_clear(b) + plan.cols = col_t_arr_new(plan.col_count) + _dataframe_resolve_args( + df, + table_name, + table_name_col, + symbols, + at if not isinstance(at, ServerTimestampType) else None, + b, + plan.col_count, + &plan.c_table_name, + &plan.at_value, + &plan.cols, + &plan.any_cols_need_gil, + field_targets) + + # Headers and table names stored in `b` are borrowed by the plan. + # Serialization rewinds to this point for every row without dropping + # those borrowed strings. + plan.str_buf_marker = qdb_pystr_buf_tell(b) + + +cdef object _dataframe_plan_debug_str(const char* buf, size_t length): + if buf == NULL: + return None + return PyUnicode_FromStringAndSize(buf, length) + + +def _debug_dataframe_plan( + object df, + *, + object table_name=None, + object table_name_col=None, + object symbols='auto', + object at=None): + cdef qdb_pystr_buf* b = qdb_pystr_buf_new() + cdef dataframe_plan_t plan = dataframe_plan_blank() + cdef size_t col_index + cdef col_t* col + cdef list cols = [] + try: + _dataframe_plan_build( + b, + df, + table_name, + table_name_col, + symbols, + at, + &plan, + _FIELD_TARGETS_ROW) + for col_index in range(plan.col_count): + col = &plan.cols.d[col_index] + cols.append({ + 'orig_index': col.setup.orig_index, + 'orig_name': df.columns[col.setup.orig_index], + 'target': _TARGET_NAMES[col.setup.target], + 'target_name': _dataframe_plan_debug_str( + col.name.buf, + col.name.len), + 'source_code': col.setup.source, + 'dispatch_code': col.dispatch_code, + 'large_string_cast_to_utf8': bool( + col.setup.large_string_cast_to_utf8), + }) + if plan.at_value == _AT_IS_SERVER_NOW: + at_value = 'server_now' + elif plan.at_value == _AT_IS_SET_BY_COLUMN: + at_value = 'column' + else: + at_value = plan.at_value + return { + 'row_count': plan.row_count, + 'col_count': plan.col_count, + 'fixed_table_name': _dataframe_plan_debug_str( + plan.c_table_name.buf, + plan.c_table_name.len), + 'at_value': at_value, + 'any_cols_need_gil': bool(plan.any_cols_need_gil), + 'cols': cols, + } + finally: + dataframe_plan_release(&plan) + qdb_pystr_buf_free(b) + + cdef inline bint _dataframe_arrow_get_bool(col_cursor_t* cursor) noexcept nogil: return ( (cursor.chunk.buffers[1])[cursor.offset // 8] & @@ -1434,13 +1861,23 @@ cdef inline void _dataframe_arrow_get_cat_value( size_t* len_out, const char** buf_out) noexcept nogil: cdef int32_t* value_index_access + cdef int64_t* value_lrg_index_access cdef int32_t value_begin + cdef int64_t value_lrg_begin cdef uint8_t* value_char_access - value_index_access = cursor.chunk.dictionary.buffers[1] - value_begin = value_index_access[key] - len_out[0] = value_index_access[key + 1] - value_begin value_char_access = cursor.chunk.dictionary.buffers[2] - buf_out[0] = &value_char_access[value_begin] + if cursor.dictionary_large_offsets: + value_lrg_index_access = cursor.chunk.dictionary.buffers[1] + value_lrg_begin = value_lrg_index_access[key] + len_out[0] = ( + value_lrg_index_access[key + 1] - value_lrg_begin) + buf_out[0] = &value_char_access[ + value_lrg_begin] + else: + value_index_access = cursor.chunk.dictionary.buffers[1] + value_begin = value_index_access[key] + len_out[0] = value_index_access[key + 1] - value_begin + buf_out[0] = &value_char_access[value_begin] cdef inline bint _dataframe_arrow_get_cat_i8( @@ -2254,6 +2691,44 @@ cdef void_int _dataframe_serialize_cell_column_ts__dt64us_numpy( raise c_err_to_py(err) +cdef void_int _dataframe_serialize_cell_column_ts__datetime_pyobj( + line_sender_buffer* ls_buf, + qdb_pystr_buf* b, + col_t* col) except -1: + cdef line_sender_error* err = NULL + cdef PyObject** access = col.cursor.chunk.buffers[1] + cdef PyObject* cell = access[col.cursor.offset] + cdef object dt + cdef object delta + cdef int64_t micros + if _dataframe_is_null_pyobj(cell): + return 0 + if not isinstance(cell, cp_datetime): + raise ValueError( + 'Expected an object of type datetime, got an object of type ' + + _fqn(type(cell)) + '.') + dt = cell + if dt.tzinfo is None: + micros = ( + _days_from_civil( + PyDateTime_GET_YEAR(dt), + PyDateTime_GET_MONTH(dt), + PyDateTime_GET_DAY(dt)) * 86_400_000_000 + + PyDateTime_DATE_GET_HOUR(dt) * 3_600_000_000 + + PyDateTime_DATE_GET_MINUTE(dt) * 60_000_000 + + PyDateTime_DATE_GET_SECOND(dt) * 1_000_000 + + PyDateTime_DATE_GET_MICROSECOND(dt)) + else: + delta = dt - datetime.datetime(1970, 1, 1, tzinfo=datetime.timezone.utc) + micros = ( + delta.days * 86_400_000_000 + + delta.seconds * 1_000_000 + + delta.microseconds) + if not line_sender_buffer_column_ts_micros(ls_buf, col.name, micros, &err): + raise c_err_to_py(err) + return 0 + + cdef void_int _dataframe_serialize_cell_column_arr_f64__arr_f64_numpyobj( line_sender_buffer* ls_buf, qdb_pystr_buf* b, @@ -2516,6 +2991,9 @@ cdef void_int _dataframe_serialize_cell( col_t* col, PyThreadState** gs) except -1: cdef col_dispatch_code_t dc = col.dispatch_code + global _dataframe_row_path_emissions + if _dataframe_count_row_path_emissions: + _dataframe_row_path_emissions += 1 # Note!: Code below will generate a `switch` statement. # Ensure this happens! Don't break the `dc == ...` pattern. if dc == col_dispatch_code_t.col_dispatch_code_skip_nulls: @@ -2610,6 +3088,8 @@ cdef void_int _dataframe_serialize_cell( _dataframe_serialize_cell_column_ts__dt64ns_numpy(ls_buf, b, col, gs) elif dc == col_dispatch_code_t.col_dispatch_code_column_ts__dt64us_numpy: _dataframe_serialize_cell_column_ts__dt64us_numpy(ls_buf, b, col, gs) + elif dc == col_dispatch_code_t.col_dispatch_code_column_ts__datetime_pyobj: + _dataframe_serialize_cell_column_ts__datetime_pyobj(ls_buf, b, col) elif dc == col_dispatch_code_t.col_dispatch_code_column_arr_f64__arr_f64_numpyobj: _dataframe_serialize_cell_column_arr_f64__arr_f64_numpyobj(ls_buf, b, col) elif dc == col_dispatch_code_t.col_dispatch_code_column_decimal__decimal_pyobj: @@ -2722,13 +3202,7 @@ cdef void_int _dataframe( object table_name_col, object symbols, object at) except -1: - cdef size_t col_count - cdef line_sender_table_name c_table_name - cdef int64_t at_value = _AT_IS_SET_BY_COLUMN - cdef col_t_arr cols = col_t_arr_blank() - cdef bint any_cols_need_gil = False - cdef qdb_pystr_pos str_buf_marker - cdef size_t row_count + cdef dataframe_plan_t plan = dataframe_plan_blank() cdef line_sender_error* err = NULL cdef size_t row_index cdef size_t col_index @@ -2737,51 +3211,38 @@ cdef void_int _dataframe( cdef PyThreadState* gs = NULL # GIL state. NULL means we have the GIL. cdef bint had_gil cdef bint was_serializing_cell = False - - _dataframe_may_import_deps() - _dataframe_check_is_dataframe(df) - row_count = len(df) - col_count = len(df.columns) - if (col_count == 0) or (row_count == 0): - return 0 # Nothing to do. + cdef bint was_auto_flush = False + cdef bint plan_has_content try: - qdb_pystr_buf_clear(b) - cols = col_t_arr_new(col_count) - _dataframe_resolve_args( + _dataframe_plan_build( + b, df, table_name, table_name_col, symbols, - at if not isinstance(at, ServerTimestampType) else None, - b, - col_count, - &c_table_name, - &at_value, - &cols, - &any_cols_need_gil) - - # We've used the str buffer up to a point for the headers. - # Instead of clearing it (which would clear the headers' memory) - # we will truncate (rewind) back to this position. - str_buf_marker = qdb_pystr_buf_tell(b) + at, + &plan, + _FIELD_TARGETS_ROW) + if (plan.col_count == 0) or (plan.row_count == 0): + return 0 # Nothing to do. line_sender_buffer_clear_marker(ls_buf) # On error, undo all added lines. if not line_sender_buffer_set_marker(ls_buf, &err): raise c_err_to_py(err) - row_gil_blip_interval = _CELL_GIL_BLIP_INTERVAL // col_count + row_gil_blip_interval = _CELL_GIL_BLIP_INTERVAL // plan.col_count if row_gil_blip_interval < 400: # ceiling reached at 100 columns row_gil_blip_interval = 400 try: # Don't move this logic up! We need the GIL to execute a `try`. # Also we can't have any other `try` blocks between here and the # `finally` block. - if not any_cols_need_gil: + if not plan.any_cols_need_gil: _ensure_doesnt_have_gil(&gs) - for row_index in range(row_count): + for row_index in range(plan.row_count): if (gs == NULL) and (row_index % row_gil_blip_interval == 0): # Release and re-acquire the GIL every so often. # This is to allow other python threads to run. @@ -2790,30 +3251,30 @@ cdef void_int _dataframe( _ensure_doesnt_have_gil(&gs) _ensure_has_gil(&gs) - qdb_pystr_buf_truncate(b, str_buf_marker) + qdb_pystr_buf_truncate(b, plan.str_buf_marker) # Table-name from `table_name` arg in Python. - if c_table_name.buf != NULL: - if not line_sender_buffer_table(ls_buf, c_table_name, &err): + if plan.c_table_name.buf != NULL: + if not line_sender_buffer_table(ls_buf, plan.c_table_name, &err): _ensure_has_gil(&gs) raise c_err_to_py(err) # Serialize columns cells. # Note: Columns are sorted: table name, symbols, fields, at. was_serializing_cell = True - for col_index in range(col_count): - col = &cols.d[col_index] + for col_index in range(plan.col_count): + col = &plan.cols.d[col_index] _dataframe_serialize_cell(ls_buf, b, col, &gs) # may raise _dataframe_col_advance(col) was_serializing_cell = False # Fixed "at" value (not from a column). - if at_value == _AT_IS_SERVER_NOW: + if plan.at_value == _AT_IS_SERVER_NOW: if not line_sender_buffer_at_now(ls_buf, &err): _ensure_has_gil(&gs) raise c_err_to_py(err) - elif at_value >= 0: - if not line_sender_buffer_at_nanos(ls_buf, at_value, &err): + elif plan.at_value >= 0: + if not line_sender_buffer_at_nanos(ls_buf, plan.at_value, &err): _ensure_has_gil(&gs) raise c_err_to_py(err) @@ -2853,6 +3314,9 @@ cdef void_int _dataframe( raise finally: _ensure_has_gil(&gs) # Note: We need the GIL for cleanup. - line_sender_buffer_clear_marker(ls_buf) - col_t_arr_release(&cols) - qdb_pystr_buf_clear(b) + plan_has_content = (plan.col_count != 0) and (plan.row_count != 0) + if plan_has_content: + line_sender_buffer_clear_marker(ls_buf) + dataframe_plan_release(&plan) + if plan_has_content: + qdb_pystr_buf_clear(b) diff --git a/src/questdb/egress.pxi b/src/questdb/egress.pxi new file mode 100644 index 00000000..3f8fc049 --- /dev/null +++ b/src/questdb/egress.pxi @@ -0,0 +1,2128 @@ +# Egress (QWP/WebSocket reader) Cython glue. +# +# `QueryResult` exposes the Arrow PyCapsule Interface +# (`__arrow_c_stream__`) directly off the Rust cursor, so polars / +# duckdb / pandas 3.0 / any Arrow-native consumer can read query +# results without pyarrow. `to_arrow`, `to_pandas`, `iter_arrow`, +# `iter_pandas` are convenience wrappers that lazy-import pyarrow. + + +cdef inline object _reader_err_code_to_py(reader_error_code code): + if code == reader_error_could_not_resolve_addr: + return IngressErrorCode.CouldNotResolveAddr + if code == reader_error_config_error: + return IngressErrorCode.ConfigError + if code == reader_error_invalid_api_call: + return IngressErrorCode.InvalidApiCall + if code == reader_error_socket_error: + return IngressErrorCode.SocketError + if code == reader_error_tls_error: + return IngressErrorCode.TlsError + if code == reader_error_auth_error: + return IngressErrorCode.AuthError + if code == reader_error_invalid_utf8: + return IngressErrorCode.InvalidUtf8 + if code == reader_error_cancelled: + return IngressErrorCode.Cancelled + if code == reader_error_failover_would_duplicate: + return IngressErrorCode.FailoverWouldDuplicate + if code == reader_error_role_mismatch: + return IngressErrorCode.RoleMismatch + # Map every other reader-specific code (handshake, protocol, invalid + # bind, schema drift, no schema, server-side errors, etc.) to + # ServerFlushError as a broad bucket. Refine later as users surface + # concrete distinctions. + return IngressErrorCode.ServerFlushError + + +cdef inline object _reader_err_to_py(reader_error* err): + """Construct an ``IngressError`` from a ``reader_error*`` and free it.""" + cdef reader_error_code code = reader_error_get_code(err) + cdef size_t c_len = 0 + cdef const char* c_msg = reader_error_msg(err, &c_len) + cdef object py_code + cdef object py_msg + try: + py_code = _reader_err_code_to_py(code) + py_msg = PyUnicode_FromStringAndSize(c_msg, c_len) + return IngressError(py_code, py_msg) + finally: + reader_error_free(err) + + +cdef class _ReaderHandle: + """Owns a ``reader*``. + + On dealloc the reader either returns to its pool or is dropped, + depending on the ``reader``'s own ownership tag (set when it + was constructed — see ``ReaderOwnership`` in the Rust FFI): + + - Pool-borrowed readers go back to the pool unless + ``_must_close`` was set, in which case the pool drops them. + - Standalone readers (from ``reader_from_conf``) are always + dropped. + + The Python side carries only one extra bit of state — + ``_must_close`` — which it forwards to the FFI via + ``reader_mark_must_close`` before calling close. We never + hold a raw ``questdb_db*`` pointer here: the reader struct + holds an ``Arc`` internally, so the pool stays alive + even if the user's ``Client.close()`` ran after ``query()`` + returned but before the reader dealloced. + + ``_must_close`` defaults to ``True``: only the generator's + clean-drain path (or code that explicitly knows the cursor + reached terminal) clears it. Any error path or abandon-without- + consume path forces the reader to drop, since the Rust + Cursor::Drop closes the transport whenever ``cursor_active`` is + still set at drop time — recycling such a reader would hand the + next borrower a broken pipe. + """ + cdef reader* _reader + cdef bint _must_close + + def __cinit__(self): + self._reader = NULL + self._must_close = True + + cdef _attach(self, reader* reader): + self._reader = reader + + cdef void _close(self) noexcept: + cdef PyThreadState* gs = NULL + if self._reader == NULL: + return + if self._must_close: + reader_mark_must_close(self._reader) + _ensure_doesnt_have_gil(&gs) + reader_close(self._reader) + _ensure_has_gil(&gs) + self._reader = NULL + + def __dealloc__(self): + self._close() + + +cdef class _CursorHandle: + """Owns a ``reader_cursor*`` + back-ref to its reader. Freed on dealloc. + + ``_reset_seq`` counts mid-query failover resets. The + ``_failover_reset_trampoline`` installed on the materialise-whole + query path bumps it (a plain C-field write, no GIL, no FFI, no + exception — honouring the reader's reentrancy contract) when the + cursor re-executes the query on a new endpoint. The accumulating + reader (``_numpy_frame_from_cursor`` etc.) compares it against the + sequence it observed at start-of-stream and, when it advanced, + discards every batch buffered so far so the replay-from-batch-0 + yields a correct whole result. + """ + cdef reader_cursor* _cursor + cdef _ReaderHandle _reader_ref + cdef object _lock + cdef int _reset_seq + + def __cinit__(self): + self._cursor = NULL + self._reader_ref = None + self._lock = threading.Lock() + self._reset_seq = 0 + + cdef _attach(self, reader_cursor* cursor, _ReaderHandle reader_ref): + self._cursor = cursor + self._reader_ref = reader_ref + + cdef void _free(self) noexcept: + cdef PyThreadState* gs = NULL + with self._lock: + if self._cursor != NULL: + _ensure_doesnt_have_gil(&gs) + reader_cursor_free(self._cursor) + _ensure_has_gil(&gs) + self._cursor = NULL + + def __dealloc__(self): + self._free() + + +cdef object _fetch_one_batch(_CursorHandle handle, object pa_module): + """Pull one batch via reader_cursor_next_arrow_batch. + + Returns: + - None on clean end-of-stream. + - A pyarrow.RecordBatch on success. + Raises IngressError on FFI error. + """ + cdef ArrowArray array + cdef ArrowSchema schema + cdef reader_error* err = NULL + cdef reader_arrow_batch_result result + cdef reader_cursor* cursor + + with handle._lock: + cursor = handle._cursor + if cursor == NULL: + raise IngressError( + IngressErrorCode.InvalidApiCall, + 'cursor is closed') + with nogil: + result = reader_cursor_next_arrow_batch( + cursor, &array, &schema, &err) + + if result == reader_arrow_batch_ok: + # Hand ownership of the array + schema buffers to pyarrow. + # _import_from_c moves the structs and nulls their release + # callbacks; pyarrow's RecordBatch owns the buffers from here. + try: + return pa_module.RecordBatch._import_from_c( + &array, &schema) + except: + if array.release != NULL: + array.release(&array) + if schema.release != NULL: + schema.release(&schema) + raise + + if result == reader_arrow_batch_end: + return None + + # Error path. + if err == NULL: + raise IngressError( + IngressErrorCode.ServerFlushError, + 'reader_cursor_next_arrow_batch returned error ' + 'without setting err_out') + raise _reader_err_to_py(err) + + +cdef tuple _fetch_all_record_batches(_CursorHandle handle, object pa_module): + """Drain the cursor into a list of ``pyarrow.RecordBatch`` we own. + + The materialise-whole entry points install the failover-reset + trampoline, which bumps ``handle._reset_seq`` when a mid-query + failover re-executes the query on a new endpoint. Because we own the + accumulator here, on a reset we discard every batch buffered so far + and restart from the replayed batch-0 — yielding a correct, + duplicate-free whole result. Returns ``(schema_or_None, batches)``; + the cursor is freed and the reader marked drained on clean + end-of-stream. + """ + cdef int seen_seq = handle._reset_seq + cdef object schema = None + cdef list batches = [] + cdef object batch + try: + while True: + batch = _fetch_one_batch(handle, pa_module) + if handle._reset_seq != seen_seq: + # Mid-query failover replayed from batch-0: drop the + # pre-failover accumulation and re-pin the schema. + seen_seq = handle._reset_seq + batches = [] + schema = None + if batch is None: + break + if schema is None: + schema = batch.schema + batches.append(batch) + except: + handle._free() + raise + _mark_reader_drained(handle) + handle._free() + return (schema, batches) + + +cdef object _build_record_batch_reader(_CursorHandle cursor_handle): + """Construct a pyarrow.RecordBatchReader over the cursor. + + Peeks the first batch to capture the stream schema, then yields + the remaining batches lazily. The cursor is explicitly freed when + the underlying generator completes (exhaustion, exception, or + close), so the owning reader can be closed without leaking a live + cursor. + """ + import pyarrow as pa + + cdef int seen_seq = cursor_handle._reset_seq + first = _fetch_one_batch(cursor_handle, pa) + if first is None: + # Empty result: cursor already reached terminal cleanly. + # Safe to return the reader to its pool. + _mark_reader_drained(cursor_handle) + cursor_handle._free() + empty = pa.table({}) + return empty.to_reader() + + schema = first.schema + + def _gen(): + try: + yield first + while True: + nxt = _fetch_one_batch(cursor_handle, pa) + if cursor_handle._reset_seq != seen_seq: + # Mid-query failover after batches were already + # yielded: the replayed batch-0 would duplicate what + # the consumer holds. Streaming can't discard it, so + # surface a clean, catchable error. + raise IngressError( + IngressErrorCode.FailoverWouldDuplicate, + 'mid-query failover would duplicate already-' + 'delivered batches; re-issue the query') + if nxt is None: + # Reached terminal cleanly; reader is reusable. + _mark_reader_drained(cursor_handle) + return + yield nxt + finally: + cursor_handle._free() + + return pa.RecordBatchReader.from_batches(schema, _gen()) + + +cdef void _mark_reader_drained(_CursorHandle cursor_handle) noexcept: + """Tell the reader handle it's safe to return to its pool on dealloc. + + The Rust Cursor::Drop closes the underlying transport whenever + ``cursor_active`` is still set. Only call this once the cursor has + reached its terminal frame (``_end``) — otherwise the next pool + borrower would see a broken pipe. + """ + if cursor_handle is None: + return + cdef _ReaderHandle reader = cursor_handle._reader_ref + if reader is not None: + reader._must_close = False + + +cdef _ReaderHandle _borrow_reader_from_pool(questdb_db* db): + """Borrow a reader from the Rust-side ``questdb_db`` pool. + + Wraps ``questdb_db_borrow_reader`` and packs the result into a + :class:`_ReaderHandle` that knows it came from this pool, so + its dealloc returns/drops via the matching FFI. + """ + cdef reader_error* err = NULL + cdef reader* reader = NULL + with nogil: + reader = questdb_db_borrow_reader(db, &err) + if reader == NULL: + if err == NULL: + raise IngressError( + IngressErrorCode.ServerFlushError, + 'questdb_db_borrow_reader returned NULL without setting err') + raise _reader_err_to_py(err) + cdef _ReaderHandle handle = _ReaderHandle() + handle._attach(reader) + return handle + + +cdef void _failover_reset_trampoline( + const reader_failover_event* event, + void* user_data) noexcept nogil: + # Fires synchronously inside reader_cursor_next_batch while the + # reader re-executes on a new endpoint, before the replayed batch-0 + # arrives. Honour the C reentrancy contract: no reentrant FFI on the + # reader/query/cursor, no exception escapes, non-blocking. user_data is + # a raw int* at the cursor's _reset_seq counter; bumping it is a plain + # pointer write (no GIL, no Python object touched), which the + # materialise-whole accumulator polls to discard its pre-failover batches. + if user_data == NULL: + return + (user_data)[0] += 1 + + +cdef _CursorHandle _execute_query(_ReaderHandle reader_handle, str sql): + """Execute a SQL query and return a _CursorHandle. + + The query is prepared with an ``on_failover_reset`` trampoline that + bumps the cursor's ``_reset_seq`` on a mid-query failover. The + materialise-whole entry points poll it to discard their partial + accumulation and replay-from-batch-0 transparently; the streaming + entry points poll it to surface a clean ``FailoverWouldDuplicate`` + (the already-yielded batches can't be discarded). Installing the + callback also clears the C-side silent-duplicate guard, so a + post-delivery failover re-executes rather than aborting outright. + """ + cdef bytes sql_bytes = sql.encode('utf-8') + cdef line_sender_error* utf8_err = NULL + cdef line_sender_utf8 sql_utf8 + cdef reader_error* err = NULL + cdef reader_query* query + cdef reader_cursor* cursor + + if not line_sender_utf8_init( + &sql_utf8, + len(sql_bytes), + sql_bytes, + &utf8_err): + raise c_err_to_py(utf8_err) + + cdef _CursorHandle handle = _CursorHandle() + + with nogil: + query = reader_prepare(reader_handle._reader, sql_utf8, &err) + + if query == NULL: + if err == NULL: + raise IngressError( + IngressErrorCode.ServerFlushError, + 'reader_prepare returned NULL without setting err') + raise _reader_err_to_py(err) + + reader_query_on_failover_reset( + query, _failover_reset_trampoline, &handle._reset_seq) + + with nogil: + cursor = reader_query_execute(&query, &err) + + if cursor == NULL: + # _query_execute consumes the query (nulls *query_inout); the + # defensive free is a no-op on the consumed handle. + reader_query_free(query) + if err == NULL: + raise IngressError( + IngressErrorCode.ServerFlushError, + 'reader_query_execute returned NULL without setting err') + raise _reader_err_to_py(err) + + handle._attach(cursor, reader_handle) + return handle + + +cdef size_t _arrow_metadata_byte_len(const char* md) noexcept: + cdef int32_t n + cdef int32_t klen + cdef int32_t vlen + cdef size_t pos + cdef int32_t i + memcpy(&n, md, sizeof(int32_t)) + pos = sizeof(int32_t) + if n <= 0: + return pos + for i in range(n): + memcpy(&klen, md + pos, sizeof(int32_t)) + if klen < 0: + return pos + pos += sizeof(int32_t) + klen + memcpy(&vlen, md + pos, sizeof(int32_t)) + if vlen < 0: + return pos + pos += sizeof(int32_t) + vlen + return pos + + +cdef void _arrow_schema_clone_release(ArrowSchema* schema) noexcept: + cdef int64_t i + if schema.format != NULL: + free(schema.format) + schema.format = NULL + if schema.name != NULL: + free(schema.name) + schema.name = NULL + if schema.metadata != NULL: + free(schema.metadata) + schema.metadata = NULL + if schema.children != NULL: + for i in range(schema.n_children): + if schema.children[i] != NULL: + if schema.children[i].release != NULL: + schema.children[i].release(schema.children[i]) + free(schema.children[i]) + free(schema.children) + schema.children = NULL + if schema.dictionary != NULL: + if schema.dictionary.release != NULL: + schema.dictionary.release(schema.dictionary) + free(schema.dictionary) + schema.dictionary = NULL + schema.release = NULL + + +cdef int _arrow_schema_deep_clone(const ArrowSchema* src, ArrowSchema* dst) noexcept: + cdef size_t format_len + cdef size_t name_len + cdef size_t metadata_len + cdef int64_t i + cdef ArrowSchema* child + memset(dst, 0, sizeof(ArrowSchema)) + dst.flags = src.flags + dst.n_children = src.n_children + if src.format != NULL: + format_len = strlen(src.format) + dst.format = malloc(format_len + 1) + if dst.format == NULL: + _arrow_schema_clone_release(dst) + return -1 + memcpy(dst.format, src.format, format_len + 1) + if src.name != NULL: + name_len = strlen(src.name) + dst.name = malloc(name_len + 1) + if dst.name == NULL: + _arrow_schema_clone_release(dst) + return -1 + memcpy(dst.name, src.name, name_len + 1) + if src.metadata != NULL: + metadata_len = _arrow_metadata_byte_len(src.metadata) + dst.metadata = malloc(metadata_len) + if dst.metadata == NULL: + _arrow_schema_clone_release(dst) + return -1 + memcpy(dst.metadata, src.metadata, metadata_len) + if src.n_children > 0: + dst.children = calloc( + src.n_children, sizeof(ArrowSchema*)) + if dst.children == NULL: + _arrow_schema_clone_release(dst) + return -1 + for i in range(src.n_children): + child = malloc(sizeof(ArrowSchema)) + if child == NULL: + _arrow_schema_clone_release(dst) + return -1 + dst.children[i] = child + if _arrow_schema_deep_clone(src.children[i], child) != 0: + _arrow_schema_clone_release(dst) + return -1 + if src.dictionary != NULL: + dst.dictionary = malloc(sizeof(ArrowSchema)) + if dst.dictionary == NULL: + _arrow_schema_clone_release(dst) + return -1 + if _arrow_schema_deep_clone(src.dictionary, dst.dictionary) != 0: + _arrow_schema_clone_release(dst) + return -1 + dst.release = _arrow_schema_clone_release + return 0 + + +cdef class _QueryStreamProducer: + """Holder for the Rust-cursor-backed ArrowArrayStream. + + The Arrow stream struct itself is owned by the enclosing PyCapsule; + this object owns just the cached `(schema, array)` and the + `_CursorHandle` keep-alive. Refcount is bumped on capsule creation + and dropped by the stream's release callback so the consumer's + capsule lifetime governs everything downstream. + """ + cdef _CursorHandle cursor_handle + cdef ArrowSchema cached_schema + cdef ArrowArray cached_array + cdef bint has_cached_schema + cdef bint has_cached_array + cdef bint exhausted + cdef char* last_error + cdef int seen_seq + + def __cinit__(self): + self.cursor_handle = None + self.has_cached_schema = False + self.has_cached_array = False + self.exhausted = False + self.last_error = NULL + self.seen_seq = 0 + memset(&self.cached_schema, 0, sizeof(ArrowSchema)) + memset(&self.cached_array, 0, sizeof(ArrowArray)) + + cdef void _free_cached(self) noexcept: + if self.has_cached_schema: + if self.cached_schema.release != NULL: + self.cached_schema.release(&self.cached_schema) + self.has_cached_schema = False + if self.has_cached_array: + if self.cached_array.release != NULL: + self.cached_array.release(&self.cached_array) + self.has_cached_array = False + + def __dealloc__(self): + self._free_cached() + if self.last_error != NULL: + free(self.last_error) + self.last_error = NULL + + +cdef void _qs_set_error(_QueryStreamProducer prod, const char* msg, size_t msg_len) noexcept: + if prod.last_error != NULL: + free(prod.last_error) + prod.last_error = NULL + prod.last_error = malloc(msg_len + 1) + if prod.last_error == NULL: + return + memcpy(prod.last_error, msg, msg_len) + prod.last_error[msg_len] = 0 + + +cdef int _qs_pull(_QueryStreamProducer prod) noexcept with gil: + cdef reader_cursor* cursor + cdef ArrowArray local_array + cdef ArrowSchema local_schema + cdef reader_error* err = NULL + cdef reader_arrow_batch_result result + cdef const char* err_msg = NULL + cdef size_t err_len = 0 + cdef reader_error_code code + cdef object py_msg + cdef bytes full + if prod.exhausted: + return 0 + if prod.cursor_handle is None: + _qs_set_error(prod, b'cursor is closed', 16) + prod.exhausted = True + return -1 + memset(&local_array, 0, sizeof(ArrowArray)) + memset(&local_schema, 0, sizeof(ArrowSchema)) + with prod.cursor_handle._lock: + cursor = prod.cursor_handle._cursor + if cursor == NULL: + _qs_set_error(prod, b'cursor is closed', 16) + prod.exhausted = True + return -1 + with nogil: + result = reader_cursor_next_arrow_batch( + cursor, &local_array, &local_schema, &err) + if result == reader_arrow_batch_ok: + if prod.cursor_handle._reset_seq != prod.seen_seq: + # Mid-query failover replayed from batch-0 after batches were + # already handed to the consumer; this one would duplicate + # them. Streaming can't discard it — surface a clean error + # (tagged like other capsule errors) and stop. + if local_array.release != NULL: + local_array.release(&local_array) + if local_schema.release != NULL: + local_schema.release(&local_schema) + full = ( + '[' + IngressErrorCode.FailoverWouldDuplicate.name + '] ' + 'mid-query failover would duplicate already-delivered ' + 'batches; re-issue the query').encode('utf-8') + _qs_set_error(prod, full, len(full)) + prod.exhausted = True + return -1 + if not prod.has_cached_schema: + memcpy(&prod.cached_schema, &local_schema, sizeof(ArrowSchema)) + prod.has_cached_schema = True + else: + if local_schema.release != NULL: + local_schema.release(&local_schema) + if prod.has_cached_array and prod.cached_array.release != NULL: + prod.cached_array.release(&prod.cached_array) + memcpy(&prod.cached_array, &local_array, sizeof(ArrowArray)) + prod.has_cached_array = True + return 0 + if result == reader_arrow_batch_end: + prod.exhausted = True + if prod.cursor_handle._reader_ref is not None: + prod.cursor_handle._reader_ref._must_close = False + return 0 + if err != NULL: + code = reader_error_get_code(err) + err_msg = reader_error_msg(err, &err_len) + try: + if err_msg != NULL: + py_msg = PyUnicode_FromStringAndSize(err_msg, err_len) + else: + py_msg = 'arrow batch fetch failed' + full = ( + '[' + _reader_err_code_to_py(code).name + '] ' + py_msg + ).encode('utf-8') + _qs_set_error(prod, full, len(full)) + except: + if err_msg != NULL: + _qs_set_error(prod, err_msg, err_len) + else: + _qs_set_error(prod, b'arrow batch fetch failed', 24) + reader_error_free(err) + else: + _qs_set_error( + prod, + b'arrow batch fetch error without err_out', 39) + prod.exhausted = True + return -1 + + +cdef int _qs_get_schema(ArrowArrayStream* stream, ArrowSchema* out) noexcept with gil: + cdef _QueryStreamProducer prod + if stream == NULL or stream.private_data == NULL: + return 22 # EINVAL + prod = <_QueryStreamProducer>stream.private_data + if not prod.has_cached_schema: + if _qs_pull(prod) != 0: + return 5 # EIO + if not prod.has_cached_schema: + if _qs_install_empty_struct_schema(prod) != 0: + return 12 # ENOMEM + if _arrow_schema_deep_clone(&prod.cached_schema, out) != 0: + _qs_set_error(prod, b'failed to clone ArrowSchema', 27) + return 12 # ENOMEM + return 0 + + +cdef int _qs_install_empty_struct_schema(_QueryStreamProducer prod) noexcept: + """For an empty result set, fabricate a zero-column struct schema + so consumers (polars / pyarrow) iterate to a clean end-of-stream + instead of erroring on missing schema.""" + cdef char* fmt = malloc(3) + if fmt == NULL: + return -1 + fmt[0] = b'+' + fmt[1] = b's' + fmt[2] = 0 + memset(&prod.cached_schema, 0, sizeof(ArrowSchema)) + prod.cached_schema.format = fmt + prod.cached_schema.release = _arrow_schema_clone_release + prod.has_cached_schema = True + return 0 + + +cdef int _qs_get_next(ArrowArrayStream* stream, ArrowArray* out) noexcept with gil: + cdef _QueryStreamProducer prod + memset(out, 0, sizeof(ArrowArray)) + if stream == NULL or stream.private_data == NULL: + return 22 # EINVAL + prod = <_QueryStreamProducer>stream.private_data + if not prod.has_cached_array: + if _qs_pull(prod) != 0: + return 5 # EIO + if prod.has_cached_array: + memcpy(out, &prod.cached_array, sizeof(ArrowArray)) + memset(&prod.cached_array, 0, sizeof(ArrowArray)) + prod.has_cached_array = False + return 0 + return 0 + + +cdef const char* _qs_get_last_error(ArrowArrayStream* stream) noexcept with gil: + cdef _QueryStreamProducer prod + if stream == NULL or stream.private_data == NULL: + return NULL + prod = <_QueryStreamProducer>stream.private_data + return prod.last_error + + +cdef void _qs_release(ArrowArrayStream* stream) noexcept with gil: + cdef _QueryStreamProducer prod + if stream == NULL or stream.private_data == NULL: + return + prod = <_QueryStreamProducer>stream.private_data + stream.private_data = NULL + stream.release = NULL + Py_DECREF(prod) + + +cdef void _qs_capsule_destructor(object capsule) noexcept: + cdef ArrowArrayStream* stream + if not PyCapsule_IsValid(capsule, b'arrow_array_stream'): + return + stream = PyCapsule_GetPointer( + capsule, b'arrow_array_stream') + if stream == NULL: + return + if stream.release != NULL: + stream.release(stream) + free(stream) + + +cdef object _make_query_stream_capsule(_CursorHandle handle): + cdef _QueryStreamProducer prod + cdef ArrowArrayStream* stream + prod = _QueryStreamProducer() + prod.cursor_handle = handle + prod.seen_seq = handle._reset_seq + stream = calloc(1, sizeof(ArrowArrayStream)) + if stream == NULL: + raise MemoryError() + stream.get_schema = _qs_get_schema + stream.get_next = _qs_get_next + stream.get_last_error = _qs_get_last_error + stream.release = _qs_release + Py_INCREF(prod) + stream.private_data = prod + try: + return PyCapsule_New( + stream, b'arrow_array_stream', _qs_capsule_destructor) + except: + Py_DECREF(prod) + free(stream) + raise + + +_NUMPY_NULLABLE_CACHE = None + + +cdef object _numpy_nullable_mapping(): + """Return a ``types_mapper`` callable that maps Arrow primitives to + pandas nullable-extension dtypes (Int64Dtype, Float64Dtype, etc.). + + Mirrors ``pandas.io._util._arrow_dtype_mapping``'s coverage so that + ``to_pandas(dtype_backend="numpy_nullable")`` here matches what + ``pd.read_parquet(..., dtype_backend="numpy_nullable")`` produces. + Non-primitive Arrow types fall through (mapper returns None) and + pyarrow.Table.to_pandas applies its default conversion. + """ + global _NUMPY_NULLABLE_CACHE + if _NUMPY_NULLABLE_CACHE is None: + import pyarrow as pa + import pandas as pd + _NUMPY_NULLABLE_CACHE = { + pa.int8(): pd.Int8Dtype(), + pa.int16(): pd.Int16Dtype(), + pa.int32(): pd.Int32Dtype(), + pa.int64(): pd.Int64Dtype(), + pa.uint8(): pd.UInt8Dtype(), + pa.uint16(): pd.UInt16Dtype(), + pa.uint32(): pd.UInt32Dtype(), + pa.uint64(): pd.UInt64Dtype(), + pa.float32(): pd.Float32Dtype(), + pa.float64(): pd.Float64Dtype(), + pa.bool_(): pd.BooleanDtype(), + pa.string(): pd.StringDtype(), + pa.large_string(): pd.StringDtype(), + }.get + return _NUMPY_NULLABLE_CACHE + + +cdef object _table_signed_dict_indices(object table): + """Recast dictionary columns whose index type is unsigned to int32. + + QuestDB SYMBOL egresses as ``dictionary(uint32, utf8)`` and pandas + rejects unsigned dictionary indices; symbol cardinality fits int32. + Returns the table unchanged when no column needs it. + """ + import pyarrow as pa + schema = table.schema + cdef list fields = [] + cdef bint changed = False + for field in schema: + ty = field.type + if (pa.types.is_dictionary(ty) + and pa.types.is_unsigned_integer(ty.index_type)): + field = field.with_type( + pa.dictionary(pa.int32(), ty.value_type, ty.ordered)) + changed = True + fields.append(field) + if not changed: + return table + return table.cast(pa.schema(fields, metadata=schema.metadata)) + + +cdef dict _KIND_NAMES = { + reader_column_kind_boolean: 'boolean', + reader_column_kind_byte: 'byte', + reader_column_kind_short: 'short', + reader_column_kind_int: 'int', + reader_column_kind_long: 'long', + reader_column_kind_float: 'float', + reader_column_kind_double: 'double', + reader_column_kind_char: 'char', + reader_column_kind_ipv4: 'ipv4', + reader_column_kind_timestamp: 'timestamp', + reader_column_kind_timestamp_nanos: 'timestamp_ns', + reader_column_kind_date: 'date', + reader_column_kind_uuid: 'uuid', + reader_column_kind_long256: 'long256', + reader_column_kind_geohash: 'geohash', + reader_column_kind_varchar: 'varchar', + reader_column_kind_binary: 'binary', + reader_column_kind_symbol: 'symbol', + reader_column_kind_double_array: 'double_array', + reader_column_kind_long_array: 'long_array', + reader_column_kind_decimal64: 'decimal', + reader_column_kind_decimal128: 'decimal', + reader_column_kind_decimal256: 'decimal', +} + + +cdef object _UUID_MODULE = None +cdef object _DECIMAL_TYPE = None + + +cdef object _uuid_module(): + global _UUID_MODULE + if _UUID_MODULE is None: + import uuid + _UUID_MODULE = uuid + return _UUID_MODULE + + +cdef object _decimal_type(): + global _DECIMAL_TYPE + if _DECIMAL_TYPE is None: + from decimal import Decimal + _DECIMAL_TYPE = Decimal + return _DECIMAL_TYPE + + +cdef int _reader_check(bint ok, reader_error* err, str what) except -1: + if ok: + return 0 + if err != NULL: + raise _reader_err_to_py(err) + raise IngressError( + IngressErrorCode.ServerFlushError, + what + ' returned false without err_out') + + +cdef object _numpy_dtype_for_kind(reader_column_kind kind, object np): + if kind == reader_column_kind_boolean: + return np.dtype(np.bool_) + if kind == reader_column_kind_byte: + return np.dtype(np.int8) + if kind == reader_column_kind_short: + return np.dtype(np.int16) + if kind == reader_column_kind_int: + return np.dtype(np.int32) + if kind == reader_column_kind_long: + return np.dtype(np.int64) + if kind == reader_column_kind_float: + return np.dtype(np.float32) + if kind == reader_column_kind_double: + return np.dtype(np.float64) + if kind == reader_column_kind_char: + return np.dtype(np.uint16) + if kind == reader_column_kind_ipv4: + return np.dtype(np.uint32) + if kind == reader_column_kind_timestamp: + return np.dtype('datetime64[us]') + if kind == reader_column_kind_timestamp_nanos: + return np.dtype('datetime64[ns]') + if kind == reader_column_kind_date: + return np.dtype('datetime64[ms]') + return None + + +cdef object _numpy_fixed_chunk( + const reader_batch* batch, + size_t col_idx, + reader_column_kind kind, + size_t row_count, + object np): + cdef reader_column_data cd + cdef reader_error* err = NULL + cdef object dtype = _numpy_dtype_for_kind(kind, np) + cdef size_t itemsize + cdef Py_ssize_t nbytes + cdef unsigned char* src + if dtype is None: + raise IngressError( + IngressErrorCode.InvalidApiCall, + 'numpy egress does not support column kind 0x{:02X} yet'.format( + kind)) + _reader_check( + reader_batch_column_data(batch, col_idx, &cd, &err), err, + 'reader_batch_column_data') + itemsize = dtype.itemsize + if cd.value_stride != itemsize: + raise IngressError( + IngressErrorCode.ServerFlushError, + 'column kind 0x{:02X} wire stride {} != numpy itemsize {}'.format( + kind, cd.value_stride, itemsize)) + if row_count == 0: + return np.empty(0, dtype=dtype) + if cd.values == NULL: + raise IngressError( + IngressErrorCode.ServerFlushError, + 'column kind 0x{:02X} has {} rows but no values buffer'.format( + kind, row_count)) + nbytes = (row_count * cd.value_stride) + src = cd.values + if kind == reader_column_kind_boolean: + return np.frombuffer((src), dtype=np.uint8) != 0 + return np.frombuffer((src), dtype=dtype).copy() + + +cdef object _numpy_varlen_chunk( + const reader_batch* batch, + size_t col_idx, + reader_column_kind kind, + size_t row_count, + object np): + cdef reader_column_data cd + cdef reader_error* err = NULL + cdef const uint32_t* offsets + cdef const uint8_t* data + cdef const uint8_t* validity + cdef size_t r + cdef uint32_t start + cdef uint32_t end + cdef bint is_binary = kind == reader_column_kind_binary + _reader_check( + reader_batch_column_data(batch, col_idx, &cd, &err), err, + 'reader_batch_column_data') + out = np.empty(row_count, dtype=object) + if row_count == 0: + return out + if cd.var_offsets == NULL: + raise IngressError( + IngressErrorCode.ServerFlushError, + 'column kind 0x{:02X} has {} rows but no offset table'.format( + kind, row_count)) + offsets = cd.var_offsets + data = cd.var_data + validity = cd.validity + for r in range(row_count): + if validity != NULL and ((validity[r >> 3] >> (r & 7)) & 1): + out[r] = None + continue + start = offsets[r] + end = offsets[r + 1] + if end < start or end > cd.var_data_len: + raise IngressError( + IngressErrorCode.ServerFlushError, + 'corrupt varlen offsets in column kind 0x{:02X}'.format( + kind)) + if end > start: + if is_binary: + out[r] = PyBytes_FromStringAndSize( + (data + start), (end - start)) + else: + out[r] = PyUnicode_FromStringAndSize( + (data + start), (end - start)) + else: + out[r] = b'' if is_binary else u'' + return out + + +cdef object _numpy_symbol_codes_chunk( + const reader_batch* batch, + size_t col_idx, + size_t row_count, + object np): + cdef reader_column_data cd + cdef reader_error* err = NULL + cdef const uint32_t* codes + cdef const uint8_t* validity + cdef size_t r + cdef int64_t[::1] mv + _reader_check( + reader_batch_column_data(batch, col_idx, &cd, &err), err, + 'reader_batch_column_data') + out = np.empty(row_count, dtype=np.int64) + if row_count == 0: + return out + if cd.symbol_codes == NULL: + raise IngressError( + IngressErrorCode.ServerFlushError, + 'symbol column has {} rows but no codes buffer'.format(row_count)) + codes = cd.symbol_codes + validity = cd.validity + mv = out + for r in range(row_count): + if validity != NULL and ((validity[r >> 3] >> (r & 7)) & 1): + mv[r] = -1 + else: + mv[r] = codes[r] + return out + + +cdef list _symbol_categories_from_dict(const reader_symbol_dict* sd): + cdef size_t i + cdef const reader_symbol_entry* e + cdef list cats = [] + for i in range(sd.entry_count): + e = &sd.entries[i] + if e.offset + e.length > sd.heap_len: + raise IngressError( + IngressErrorCode.ServerFlushError, + 'corrupt symbol dictionary heap offsets') + cats.append( + PyUnicode_FromStringAndSize( + (sd.heap + e.offset), e.length)) + return cats + + +cdef object _numpy_geohash_chunk( + const reader_batch* batch, + size_t col_idx, + size_t row_count, + object np): + cdef reader_column_data cd + cdef reader_error* err = NULL + cdef object dtype + cdef size_t stride + cdef size_t target + cdef Py_ssize_t nbytes + cdef unsigned char* src + _reader_check( + reader_batch_column_data(batch, col_idx, &cd, &err), err, + 'reader_batch_column_data') + stride = cd.value_stride + if stride == 1: + dtype = np.dtype(np.int8) + target = 1 + elif stride == 2: + dtype = np.dtype(np.int16) + target = 2 + elif stride == 3 or stride == 4: + dtype = np.dtype(np.int32) + target = 4 + elif stride >= 5 and stride <= 8: + dtype = np.dtype(np.int64) + target = 8 + else: + raise IngressError( + IngressErrorCode.ServerFlushError, + 'unexpected geohash byte width {}'.format(stride)) + if row_count == 0: + return np.empty(0, dtype=dtype) + if cd.values == NULL: + raise IngressError( + IngressErrorCode.ServerFlushError, + 'geohash column has {} rows but no values buffer'.format(row_count)) + nbytes = (row_count * stride) + src = cd.values + if stride == target: + return np.frombuffer((src), dtype=dtype).copy() + raw = np.frombuffer( + (src), dtype=np.uint8).reshape( + row_count, stride) + wide = np.zeros((row_count, target), dtype=np.uint8) + wide[:, :stride] = raw + return wide.view(dtype).reshape(row_count) + + +cdef object _numpy_uuid_chunk( + const reader_batch* batch, + size_t col_idx, + size_t row_count, + object np): + cdef object _uuid = _uuid_module() + cdef reader_column_data cd + cdef reader_error* err = NULL + cdef const uint8_t* validity + cdef const uint8_t* values + cdef size_t r + cdef uint64_t lo + cdef uint64_t hi + _reader_check( + reader_batch_column_data(batch, col_idx, &cd, &err), err, + 'reader_batch_column_data') + out = np.empty(row_count, dtype=object) + if row_count == 0: + return out + if cd.values == NULL: + raise IngressError( + IngressErrorCode.ServerFlushError, + 'uuid column has {} rows but no values buffer'.format(row_count)) + validity = cd.validity + values = cd.values + for r in range(row_count): + if validity != NULL and ((validity[r >> 3] >> (r & 7)) & 1): + out[r] = None + continue + memcpy(&lo, values + r * 16, 8) + memcpy(&hi, values + r * 16 + 8, 8) + out[r] = _uuid.UUID(int=((hi) << 64) | (lo)) + return out + + +cdef object _numpy_long256_chunk( + const reader_batch* batch, + size_t col_idx, + size_t row_count, + object np): + cdef reader_column_data cd + cdef reader_error* err = NULL + cdef const uint8_t* validity + cdef const uint8_t* values + cdef size_t r + _reader_check( + reader_batch_column_data(batch, col_idx, &cd, &err), err, + 'reader_batch_column_data') + out = np.empty(row_count, dtype=object) + if row_count == 0: + return out + if cd.values == NULL: + raise IngressError( + IngressErrorCode.ServerFlushError, + 'long256 column has {} rows but no values buffer'.format(row_count)) + validity = cd.validity + values = cd.values + for r in range(row_count): + if validity != NULL and ((validity[r >> 3] >> (r & 7)) & 1): + out[r] = None + continue + out[r] = int.from_bytes( + PyBytes_FromStringAndSize((values + r * 32), 32), + 'little', signed=False) + return out + + +cdef object _numpy_decimal_chunk( + const reader_batch* batch, + size_t col_idx, + size_t row_count, + object np): + cdef object Decimal = _decimal_type() + cdef reader_column_data cd + cdef reader_error* err = NULL + cdef const uint8_t* validity + cdef const uint8_t* values + cdef size_t r + cdef size_t width + cdef int scale + _reader_check( + reader_batch_column_data(batch, col_idx, &cd, &err), err, + 'reader_batch_column_data') + out = np.empty(row_count, dtype=object) + if row_count == 0: + return out + if cd.values == NULL: + raise IngressError( + IngressErrorCode.ServerFlushError, + 'decimal column has {} rows but no values buffer'.format(row_count)) + validity = cd.validity + values = cd.values + width = cd.value_stride + scale = cd.decimal_scale + for r in range(row_count): + if validity != NULL and ((validity[r >> 3] >> (r & 7)) & 1): + out[r] = None + continue + unscaled = int.from_bytes( + PyBytes_FromStringAndSize( + (values + r * width), width), + 'little', signed=True) + digits = tuple(int(c) for c in str(abs(unscaled))) + out[r] = Decimal((1 if unscaled < 0 else 0, digits, -scale)) + return out + + +cdef object _numpy_array_chunk( + const reader_batch* batch, + size_t col_idx, + reader_column_kind kind, + size_t row_count, + object np): + cdef reader_array_data ad + cdef reader_error* err = NULL + cdef const uint8_t* validity + cdef const uint8_t* data + cdef const uint32_t* data_offsets + cdef const uint32_t* shapes + cdef const uint32_t* shape_offsets + cdef size_t r + cdef size_t k + cdef uint32_t dstart + cdef uint32_t dend + cdef uint32_t sstart + cdef uint32_t send + cdef Py_ssize_t blen + if kind != reader_column_kind_double_array: + raise IngressError( + IngressErrorCode.InvalidApiCall, + 'numpy egress supports only double arrays (kind 0x{:02X})'.format( + kind)) + _reader_check( + reader_batch_array_column_data(batch, col_idx, &ad, &err), err, + 'reader_batch_array_column_data') + out = np.empty(row_count, dtype=object) + if row_count == 0: + return out + if ad.data_offsets == NULL or ad.shape_offsets == NULL: + raise IngressError( + IngressErrorCode.ServerFlushError, + 'array column has {} rows but no offset tables'.format(row_count)) + validity = ad.validity + data = ad.data + data_offsets = ad.data_offsets + shapes = ad.shapes + shape_offsets = ad.shape_offsets + for r in range(row_count): + if validity != NULL and ((validity[r >> 3] >> (r & 7)) & 1): + out[r] = None + continue + dstart = data_offsets[r] + dend = data_offsets[r + 1] + blen = (dend - dstart) + if blen > 0: + flat = np.frombuffer( + (((data + dstart))), + dtype=np.float64).copy() + else: + flat = np.empty(0, dtype=np.float64) + sstart = shape_offsets[r] + send = shape_offsets[r + 1] + if send > sstart: + out[r] = flat.reshape( + tuple(shapes[sstart + k] for k in range(send - sstart))) + else: + out[r] = flat + return out + + +cdef object _numpy_column_chunk( + const reader_batch* batch, + size_t col_idx, + reader_column_kind kind, + size_t row_count, + object np): + if kind == reader_column_kind_symbol: + return _numpy_symbol_codes_chunk(batch, col_idx, row_count, np) + if (kind == reader_column_kind_varchar + or kind == reader_column_kind_binary): + return _numpy_varlen_chunk(batch, col_idx, kind, row_count, np) + if kind == reader_column_kind_geohash: + return _numpy_geohash_chunk(batch, col_idx, row_count, np) + if kind == reader_column_kind_uuid: + return _numpy_uuid_chunk(batch, col_idx, row_count, np) + if kind == reader_column_kind_long256: + return _numpy_long256_chunk(batch, col_idx, row_count, np) + if (kind == reader_column_kind_decimal64 + or kind == reader_column_kind_decimal128 + or kind == reader_column_kind_decimal256): + return _numpy_decimal_chunk(batch, col_idx, row_count, np) + if (kind == reader_column_kind_double_array + or kind == reader_column_kind_long_array): + return _numpy_array_chunk(batch, col_idx, kind, row_count, np) + return _numpy_fixed_chunk(batch, col_idx, kind, row_count, np) + + +cdef bint _is_hybrid_int(reader_column_kind kind): + return (kind == reader_column_kind_int + or kind == reader_column_kind_long + or kind == reader_column_kind_ipv4 + or kind == reader_column_kind_geohash) + + +cdef object _numpy_validity_mask( + const reader_batch* batch, + size_t col_idx, + size_t row_count, + object np): + cdef reader_column_data cd + cdef reader_error* err = NULL + cdef Py_ssize_t vbytes + cdef unsigned char* vsrc + _reader_check( + reader_batch_column_data(batch, col_idx, &cd, &err), err, + 'reader_batch_column_data') + if row_count == 0 or cd.validity == NULL: + return None + vbytes = ((row_count + 7) // 8) + vsrc = cd.validity + return np.unpackbits( + np.frombuffer((vsrc), dtype=np.uint8), + count=row_count, bitorder='little').astype(bool) + + +cdef object _build_nullable_array( + values, mask, reader_column_kind kind, object pd): + if (kind == reader_column_kind_float + or kind == reader_column_kind_double): + return pd.arrays.FloatingArray(values, mask) + if kind == reader_column_kind_boolean: + return pd.arrays.BooleanArray(values, mask) + return pd.arrays.IntegerArray(values, mask) + + +cdef object _combine_hybrid_mask(list value_chunks, list mask_chunks, object np): + cdef size_t n = len(mask_chunks) + cdef size_t i + cdef bint any_null = False + for i in range(n): + if mask_chunks[i] is not None: + any_null = True + break + if not any_null: + return None + parts = [] + for i in range(n): + if mask_chunks[i] is None: + parts.append(np.zeros(len(value_chunks[i]), dtype=bool)) + else: + parts.append(mask_chunks[i]) + if len(parts) == 1: + return parts[0] + return np.concatenate(parts) + + +cdef tuple _numpy_extract_meta(const reader_batch* batch): + cdef size_t n_cols = reader_batch_column_count(batch) + cdef size_t col_idx + cdef reader_column_kind kind = reader_column_kind_unknown + cdef const char* name_buf = NULL + cdef size_t name_len = 0 + cdef reader_error* err = NULL + cdef reader_column_data cd_meta + cdef bint has_symbol = False + col_names = [] + col_kinds = [] + col_scales = [] + col_precision = [] + for col_idx in range(n_cols): + _reader_check( + reader_batch_column_name( + batch, col_idx, &name_buf, &name_len, &err), + err, 'reader_batch_column_name') + col_names.append( + PyUnicode_FromStringAndSize(name_buf, name_len)) + _reader_check( + reader_batch_column_kind(batch, col_idx, &kind, &err), + err, 'reader_batch_column_kind') + col_kinds.append(kind) + col_scales.append(None) + col_precision.append(None) + if kind == reader_column_kind_symbol: + has_symbol = True + elif (kind == reader_column_kind_geohash + or kind == reader_column_kind_decimal64 + or kind == reader_column_kind_decimal128 + or kind == reader_column_kind_decimal256): + if reader_batch_column_data(batch, col_idx, &cd_meta, &err): + if kind == reader_column_kind_geohash: + col_precision[col_idx] = cd_meta.geohash_precision_bits + else: + col_scales[col_idx] = cd_meta.decimal_scale + elif err != NULL: + reader_error_free(err) + err = NULL + return (col_names, col_kinds, col_scales, col_precision, has_symbol) + + +cdef object _FROM_CODES_HAS_VALIDATE = None + + +cdef object _symbol_from_codes(object pd, object arr, object dtype): + # SYMBOL codes come straight off the QWP wire — every code is a valid index + # into the dict (or -1 for null) — so skip pandas' O(rows) bounds + # re-validation. `validate=` exists since pandas 1.1; older pandas keeps the + # checked path. `dtype=` reuses one cached category Index across columns and + # batches (vs `categories=`, which rebuilds it every call). + global _FROM_CODES_HAS_VALIDATE + if _FROM_CODES_HAS_VALIDATE is None: + import inspect + _FROM_CODES_HAS_VALIDATE = ( + 'validate' in inspect.signature( + pd.Categorical.from_codes).parameters) + if _FROM_CODES_HAS_VALIDATE: + return pd.Categorical.from_codes(arr, dtype=dtype, validate=False) + return pd.Categorical.from_codes(arr, dtype=dtype) + + +cdef object _numpy_assemble_frame( + list col_names, list col_kinds, list col_scales, + list col_precision, list col_chunks, list symbol_categories, + object np, object pd, list col_masks, object symbol_dtype=None): + cdef size_t n_cols = len(col_names) + cdef size_t col_idx + cdef reader_column_kind kind + arrays = [] + for col_idx in range(n_cols): + kind = col_kinds[col_idx] + chunks = col_chunks[col_idx] + if len(chunks) == 1: + arr = chunks[0] + else: + arr = np.concatenate(chunks) + if kind == reader_column_kind_symbol: + # Build the category Index once (here for fetch-all, or supplied + # pre-built by `_NumpyBatchIter` across batches) and reuse it via + # `dtype=` for every SYMBOL column, then build from the codes with no + # bounds re-validation. Avoids the O(columns/batches x cardinality) + # Index rebuild + validation the `categories=`/validate path costs. + if symbol_dtype is None: + symbol_dtype = pd.CategoricalDtype(symbol_categories) + arr = _symbol_from_codes(pd, arr, symbol_dtype) + elif _is_hybrid_int(kind): + mask = _combine_hybrid_mask(chunks, col_masks[col_idx], np) + if mask is not None: + arr = _build_nullable_array(arr, mask, kind, pd) + arrays.append(arr) + frame = pd.DataFrame(dict(enumerate(arrays)), copy=False) + frame.columns = col_names + columns_meta = {} + for col_idx in range(n_cols): + entry = {'kind': _KIND_NAMES.get(col_kinds[col_idx], 'unknown')} + if col_scales[col_idx] is not None: + entry['scale'] = col_scales[col_idx] + if col_precision[col_idx] is not None: + entry['precision_bits'] = col_precision[col_idx] + columns_meta[col_names[col_idx]] = entry + frame.attrs['questdb'] = {'version': 1, 'columns': columns_meta} + return frame + + +cdef tuple _numpy_batch_columns( + const reader_batch* batch, list col_kinds, + size_t n_cols, size_t row_count, object np): + cdef size_t col_idx + cdef reader_column_kind kind + chunks = [] + masks = [] + for col_idx in range(n_cols): + kind = col_kinds[col_idx] + chunks.append(_numpy_column_chunk(batch, col_idx, kind, row_count, np)) + if _is_hybrid_int(kind): + masks.append(_numpy_validity_mask(batch, col_idx, row_count, np)) + else: + masks.append(None) + return (chunks, masks) + + +cdef object _numpy_frame_from_cursor(_CursorHandle handle): + import numpy as np + import pandas as pd + cdef reader_cursor* cursor + cdef reader_error* err = NULL + cdef const reader_batch* batch + cdef reader_symbol_dict sd + cdef size_t n_cols = 0 + cdef size_t row_count = 0 + cdef size_t col_idx + cdef size_t prev_dict_n = 0 + cdef bint first = True + cdef bint has_symbol = False + cdef int seen_seq + + if handle is None or handle._cursor == NULL: + raise IngressError(IngressErrorCode.InvalidApiCall, 'cursor is closed') + cursor = handle._cursor + seen_seq = handle._reset_seq + + col_names = [] + col_kinds = [] + col_scales = [] + col_precision = [] + col_chunks = [] + col_masks = [] + symbol_categories = [] + + try: + while True: + with nogil: + batch = reader_cursor_next_batch(cursor, &err) + if handle._reset_seq != seen_seq: + # Mid-query failover replayed from batch-0: discard the + # pre-failover accumulation and re-derive the schema. + seen_seq = handle._reset_seq + first = True + prev_dict_n = 0 + has_symbol = False + col_chunks = [] + col_masks = [] + symbol_categories = [] + if batch == NULL: + if err != NULL: + raise _reader_err_to_py(err) + break + row_count = reader_batch_row_count(batch) + if first: + (col_names, col_kinds, col_scales, col_precision, + has_symbol) = _numpy_extract_meta(batch) + n_cols = len(col_names) + col_chunks = [[] for _ in range(n_cols)] + col_masks = [[] for _ in range(n_cols)] + first = False + if has_symbol: + _reader_check( + reader_batch_symbol_dict(batch, &sd, &err), err, + 'reader_batch_symbol_dict') + if sd.entry_count > prev_dict_n: + symbol_categories = _symbol_categories_from_dict(&sd) + prev_dict_n = sd.entry_count + batch_chunks, batch_masks = _numpy_batch_columns( + batch, col_kinds, n_cols, row_count, np) + for col_idx in range(n_cols): + col_chunks[col_idx].append(batch_chunks[col_idx]) + col_masks[col_idx].append(batch_masks[col_idx]) + except: + handle._free() + raise + + _mark_reader_drained(handle) + handle._free() + + if first: + return pd.DataFrame() + return _numpy_assemble_frame( + col_names, col_kinds, col_scales, col_precision, + col_chunks, symbol_categories, np, pd, col_masks) + + +cdef class _PolarsSymbolRegistry: + """A polars ``Categories`` shared by every SYMBOL column on the same + (append-only) connection dictionary — interned once and grown as the dict + grows, so a QWP code is its own physical categorical code and casts straight + into a ``Categorical`` with no per-row remap (the Rust ``SymbolRegistry`` + analog). The interned dictionary is pinned in ``base`` to stop polars' + auto-GC mapping from dropping it between calls.""" + cdef object pl + cdef object cats + cdef object base + cdef object pinned + cdef Py_ssize_t n + + def __cinit__(self, object pl): + self.pl = pl + self.cats = pl.Categories.random('questdb_symbol', physical=pl.UInt32) + self.base = None + self.pinned = None + self.n = 0 + + def accepts(self, object cats_arrow): + # True if this registry's Categories maps `cats_arrow`'s codes + # correctly: the smaller of (pinned, cats_arrow) must be a prefix of the + # larger. The connection dict is append-only, so columns sharing it only + # ever differ by a growth suffix; a column-local dict fails the check and + # gets its own registry. `equals` short-circuits on the shared buffer. + cdef Py_ssize_t m + if self.pinned is None: + return True + m = len(cats_arrow) + if m <= self.n: + return self.pinned.slice(0, m).equals(cats_arrow) + return cats_arrow.slice(0, self.n).equals(self.pinned) + + def column(self, object name, object codes, object cats_arrow): + cdef object pl = self.pl + if cats_arrow is not None and len(cats_arrow) > self.n: + self.base = pl.Series( + pl.from_arrow(cats_arrow), dtype=pl.Categorical(self.cats)) + self.pinned = cats_arrow + self.n = len(cats_arrow) + return codes.cast(pl.UInt32).cast(pl.Categorical(self.cats)).alias(name) + + +cdef tuple _polars_dict_codes_cats(object col, object pl): + # col: a pyarrow ChunkedArray of dictionary type, one chunk per wire batch. + # The dict is append-only and shared across the query — the Rust egress + # attaches the full active connection dict to every batch and only ever grows + # it (see `SymbolValuesCache` in c-questdb-client), and `Table.from_batches` + # keeps the chunks in emission order — so the last chunk's dictionary is the + # largest and covers every (global, stable) code. Returns (polars Series of + # the dict indices with nulls preserved via Arrow validity, the full + # dictionary values array). The indices flow straight from Arrow to polars + # (no numpy round-trip, no -1 sentinel). + cdef list chunks = col.chunks + cdef list parts = [] + cdef object ch + if not chunks: + return (pl.Series([], dtype=pl.UInt32), None) + for ch in chunks: + parts.append(pl.from_arrow(ch.indices)) + return (pl.concat(parts) if len(parts) > 1 else parts[0], + chunks[-1].dictionary) + + +cdef object _cast_to_string_view(object col, object svt, object pa, object pc): + # Cast a Utf8 / LargeUtf8 ChunkedArray to Arrow ``string_view``, repairing the + # null trailing variadic data buffer that pyarrow's cast leaves behind for a + # chunk whose every value is inline (<= 12 bytes). The null buffer validates + # fine in-process, but the Arrow C-Data-Interface exporter dereferences it + # unconditionally — so it crashes (SIGSEGV) the moment polars re-exports the + # column across the C-ABI. Swapping the null for a shared 0-length buffer is + # zero-copy and makes the export safe. A natively built string_view already + # uses an empty (non-null) buffer here, so only the cast output needs fixing. + cdef list chunks = [] + cdef object empty = None + cdef object ch, view, bufs + for ch in col.chunks: + view = pc.cast(ch, svt) + bufs = view.buffers() + if bufs and bufs[-1] is None: + if empty is None: + empty = pa.allocate_buffer(0) + view = pa.Array.from_buffers( + svt, len(view), bufs[:-1] + [empty], + null_count=view.null_count, offset=view.offset) + chunks.append(view) + return pa.chunked_array(chunks, type=svt) + + +cdef object _polars_nonsymbol_frame( + object table, list nd_idx, object pl, object pa): + # `pl.from_arrow` for the non-SYMBOL columns. Utf8 / LargeUtf8 columns are + # first cast to Arrow `string_view` (when pyarrow exposes it) so polars + # adopts the byte/view buffers zero-copy — its `String` dtype *is* the view + # ("German strings") layout — instead of rebuilding the view from the offset + # layout. Fixed-width columns are already adopted zero-copy. + if not nd_idx: + return None + cdef object tbl = table.select(nd_idx) + cdef object types = tbl.schema.types + cdef object sv = getattr(pa, 'string_view', None) + cdef object svt + cdef Py_ssize_t j + if sv is not None and any( + pa.types.is_string(t) or pa.types.is_large_string(t) for t in types): + import pyarrow.compute as pc + svt = sv() + for j in range(len(types)): + if pa.types.is_string(types[j]) or pa.types.is_large_string(types[j]): + tbl = tbl.set_column( + j, tbl.schema.field(j).with_type(svt), + _cast_to_string_view(tbl.column(j), svt, pa, pc)) + return pl.from_arrow(tbl) + + +cdef object _polars_dataframe_hybrid( + object table, object pl, object pa, dict registries): + # SYMBOL (dictionary) columns are built from codes + dict via a `Categories` + # registry (low CPU, no per-row remap); every other column keeps its exact + # `pl.from_arrow` dtype. One shared registry (key -1) serves every column on + # the connection dict — interned once, not per column — and falls back to a + # per-column registry for a column-local dict. `registries` persists across + # batches so a streaming `iter_polars` stitches via one `Categories`. + cdef list types = table.schema.types + cdef list is_dict = [pa.types.is_dictionary(t) for t in types] + if not any(is_dict): + return pl.from_arrow(table) + cdef list names = table.column_names + cdef list nd_idx = [i for i in range(len(types)) if not is_dict[i]] + nd = _polars_nonsymbol_frame(table, nd_idx, pl, pa) + cdef list cols = [] + cdef Py_ssize_t i + cdef object codes, cats, reg + cdef object shared = registries.get(-1) + for i in range(len(types)): + if is_dict[i]: + codes, cats = _polars_dict_codes_cats(table.column(i), pl) + if shared is None: + shared = _PolarsSymbolRegistry(pl) + registries[-1] = shared + if shared.accepts(cats): + reg = shared + else: + reg = registries.get(i) + if reg is None: + reg = _PolarsSymbolRegistry(pl) + registries[i] = reg + cols.append(reg.column(names[i], codes, cats)) + else: + cols.append(nd.get_column(names[i])) + return pl.DataFrame(cols) + + +cdef class _PolarsBatchIter: + """Streaming `polars.DataFrame` per result batch. Holds the per-symbol + `Categories` registries so every batch's Categoricals share one identity + and `pl.concat` stitches cleanly.""" + cdef object reader + cdef object pl + cdef object pa + cdef dict registries + cdef bint use_hybrid + + def __cinit__(self, _CursorHandle handle, object pl, object pa): + self.reader = _build_record_batch_reader(handle) + self.pl = pl + self.pa = pa + self.registries = {} + self.use_hybrid = getattr(pl, 'Categories', None) is not None + + def __iter__(self): + return self + + def __next__(self): + batch = next(self.reader) + table = self.pa.Table.from_batches([batch]) + if not self.use_hybrid: + return self.pl.from_arrow(table) + try: + return _polars_dataframe_hybrid( + table, self.pl, self.pa, self.registries) + except Exception: + return self.pl.from_arrow(table) + + +cdef class _NumpyBatchIter: + cdef _CursorHandle handle + cdef object np + cdef object pd + cdef list col_names + cdef list col_kinds + cdef list col_scales + cdef list col_precision + cdef bint first + cdef bint has_symbol + cdef bint done + cdef size_t prev_dict_n + cdef list symbol_categories + cdef object symbol_dtype + cdef int seen_seq + + def __cinit__(self, _CursorHandle handle): + import numpy as np + import pandas as pd + self.handle = handle + self.np = np + self.pd = pd + self.col_names = [] + self.col_kinds = [] + self.col_scales = [] + self.col_precision = [] + self.first = True + self.has_symbol = False + self.done = False + self.prev_dict_n = 0 + self.symbol_categories = [] + self.symbol_dtype = None + self.seen_seq = handle._reset_seq if handle is not None else 0 + + def __iter__(self): + return self + + def __next__(self): + cdef reader_cursor* cursor + cdef reader_error* err = NULL + cdef const reader_batch* batch + cdef reader_symbol_dict sd + cdef size_t row_count + cdef size_t n_cols + if self.done or self.handle is None or self.handle._cursor == NULL: + raise StopIteration + cursor = self.handle._cursor + with nogil: + batch = reader_cursor_next_batch(cursor, &err) + if self.handle._reset_seq != self.seen_seq: + # Mid-query failover after batches were already yielded: the + # replayed batch-0 would duplicate them. Streaming can't + # discard it, so surface a clean, catchable error. + self.done = True + self.handle._free() + raise IngressError( + IngressErrorCode.FailoverWouldDuplicate, + 'mid-query failover would duplicate already-delivered ' + 'batches; re-issue the query') + if batch == NULL: + if err != NULL: + self.done = True + self.handle._free() + raise _reader_err_to_py(err) + self.done = True + _mark_reader_drained(self.handle) + self.handle._free() + raise StopIteration + try: + row_count = reader_batch_row_count(batch) + if self.first: + (self.col_names, self.col_kinds, self.col_scales, + self.col_precision, self.has_symbol) = \ + _numpy_extract_meta(batch) + self.first = False + n_cols = len(self.col_names) + if self.has_symbol: + _reader_check( + reader_batch_symbol_dict(batch, &sd, &err), err, + 'reader_batch_symbol_dict') + if sd.entry_count > self.prev_dict_n: + self.symbol_categories = _symbol_categories_from_dict(&sd) + # Cache the dtype so each batch's from_codes reuses the + # category Index instead of rebuilding it per batch + # (1056x faster on high-cardinality SYMBOLs). + self.symbol_dtype = self.pd.CategoricalDtype( + self.symbol_categories) + self.prev_dict_n = sd.entry_count + batch_chunks, batch_masks = _numpy_batch_columns( + batch, self.col_kinds, n_cols, row_count, self.np) + col_chunks = [[c] for c in batch_chunks] + col_masks = [[m] for m in batch_masks] + return _numpy_assemble_frame( + self.col_names, self.col_kinds, self.col_scales, + self.col_precision, col_chunks, self.symbol_categories, + self.np, self.pd, col_masks, symbol_dtype=self.symbol_dtype) + except: + self.done = True + self.handle._free() + raise + + def __dealloc__(self): + if not self.done and self.handle is not None: + self.handle._free() + + +cdef object _resolve_arrow_to_pandas_kwargs(dtype_backend, types_mapper): + kwargs = {} + if types_mapper is not None: + kwargs['types_mapper'] = types_mapper + elif dtype_backend == 'pyarrow': + import pandas as pd + kwargs['types_mapper'] = pd.ArrowDtype + elif dtype_backend == 'numpy_nullable': + kwargs['types_mapper'] = _numpy_nullable_mapping() + elif dtype_backend is not None: + raise ValueError( + f'dtype_backend={dtype_backend!r} is invalid, ' + 'only "pyarrow" and "numpy_nullable" are allowed') + return kwargs + + +def _debug_egress_pool_stats(client): + """Return ``(in_use, idle)`` from the client's reader pool. + + The Rust pool doesn't track "opened" / "reused" as counters — they + fall out of ``in_use + idle`` plus the lazy-init pattern (first + borrow opens a connection; the idle list grows on returns; reuse + is implicit). Tests assert reuse by checking that ``idle == 1`` + after sequential queries that each borrowed and returned. Returns + ``None`` if the Client is closed. + + Not part of the public API. + """ + cdef Client c = client + cdef questdb_db* db = c._db + if db == NULL: + return None + # FFI exposes the counts via the Rust QuestDb methods; we surface + # them through the diagnostic-only `dbg_` reader-pool accessors. + return ( + questdb_db_dbg_reader_in_use_count(db), + questdb_db_dbg_reader_free_count(db)) + + +class QueryResult: + """Result of ``Client.query(sql)``. + + Streams query rows as Arrow record batches. **Single-use**: each + materialisation method (``to_pandas``, ``to_arrow``, ``iter_arrow``, + ``iter_pandas``, or the ``__arrow_c_stream__`` PyCapsule protocol) + consumes the underlying cursor; the second consumption raises + ``IngressError``. + + ``__arrow_c_stream__`` is native — the cursor's record batches are + exposed directly through the Arrow C Data Interface, so polars / + duckdb / pandas 3.0 / any Arrow-native consumer can read query + results without pyarrow installed. ``to_arrow`` / ``to_pandas`` / + ``iter_arrow`` / ``iter_pandas`` are convenience wrappers that + do require pyarrow. + + Example:: + + with client.query('SELECT * FROM trades WHERE ts > $1') as result: + df = polars.from_arrow(result) # no pyarrow + # df = result.to_pandas() # pyarrow required + # table = pa.table(result) # pyarrow required + """ + + def __init__(self, _CursorHandle cursor_handle): + self._cursor_handle = cursor_handle + self._consumed = False + + def _take_cursor_handle(self): + if self._consumed: + raise IngressError( + IngressErrorCode.InvalidApiCall, + 'QueryResult already consumed') + if self._cursor_handle is None: + raise IngressError( + IngressErrorCode.InvalidApiCall, + 'QueryResult cursor was closed') + self._consumed = True + handle = self._cursor_handle + self._cursor_handle = None + return handle + + def __arrow_c_stream__(self, requested_schema=None): + if requested_schema is not None: + raise NotImplementedError( + 'requested_schema is not supported; consume the stream ' + 'and project on the consumer side.') + # Streaming: hand batches out incrementally. A post-delivery + # failover bumps the cursor's _reset_seq; the capsule producer + # surfaces FailoverWouldDuplicate rather than feeding the + # replayed batch-0 as a duplicate (see _qs_pull). + return _make_query_stream_capsule(self._take_cursor_handle()) + + def to_arrow(self): + """Read the full result into a ``pyarrow.Table``. Requires pyarrow. + + Materialise-whole: a mid-query failover replays the result + transparently — the partial accumulation we hold is discarded + from batch-0. The pyarrow-free streaming path + (``__arrow_c_stream__`` consumed by ``polars.from_arrow(result)`` + / ``pa.table(result)``) instead surfaces ``FailoverWouldDuplicate`` + on a post-delivery failover. + """ + import pyarrow as pa + handle = self._take_cursor_handle() + schema, batches = _fetch_all_record_batches(handle, pa) + if schema is None: + return pa.table({}) + return pa.Table.from_batches(batches, schema) + + def to_pandas(self, *, dtype_backend=None, types_mapper=None): + """Read the full result into a ``pandas.DataFrame``. + + The default is a native (no pyarrow), DuckDB-style hybrid built + straight from the QWP column buffers: a nullable integer column + with nulls becomes a pandas nullable ``Int*`` (``pd.NA``); without + nulls it stays plain numpy. ``double``/``float`` stay numpy with + ``NaN``; ``SYMBOL`` → ``Categorical``; ``TIMESTAMP`` → + ``datetime64`` (``NaT``); strings/decimal/uuid/binary → ``object``. + Analysis-safe (aggregations skip ``pd.NA``/``NaN``), and feeds back + into :meth:`Client.dataframe` for a type round-trip — the column + kinds are carried in ``df.attrs['questdb']``. + + ``dtype_backend="pyarrow"`` / ``"numpy_nullable"`` / ``types_mapper`` + select the pyarrow-backed path instead (``pd.ArrowDtype``, pandas + nullable extension dtypes, or a custom mapper) — matching the + ``pd.read_sql`` / ``pd.read_parquet`` convention. + """ + if dtype_backend is not None and types_mapper is not None: + raise ValueError( + 'pass at most one of dtype_backend, types_mapper') + if dtype_backend is None and types_mapper is None: + return self._to_pandas_numpy() + table = _table_signed_dict_indices(self.to_arrow()) + return table.to_pandas( + **_resolve_arrow_to_pandas_kwargs(dtype_backend, types_mapper)) + + def to_polars(self): + """Read the full result into a ``polars.DataFrame``. Requires polars + and pyarrow. + + Non-``SYMBOL`` columns keep their exact ``polars.from_arrow`` dtypes + (tz-aware ``Datetime``, ``Decimal``, ``Binary``, ``List``/``Array``, + …). ``SYMBOL`` columns are built into a polars ``Categorical`` directly + from their codes + dictionary through a persistent ``Categories`` + registry — the wire code is its own physical categorical code, so + there is no per-row ``Dictionary -> Categorical`` remap. Falls back to + ``polars.from_arrow`` when polars' (unstable) ``Categories`` API is + unavailable. + + Materialise-whole: a mid-query failover replays the result + transparently. This accumulates batches in-library (via pyarrow) + so the partial result can be discarded on failover; for the + pyarrow-free streaming path consume ``__arrow_c_stream__`` + directly (``polars.from_arrow(result)``), which surfaces + ``FailoverWouldDuplicate`` on a post-delivery failover. + """ + try: + import polars as pl + except ImportError as ie: + raise ImportError( + '`polars` is required for `to_polars()`. ' + 'Install with `pip install polars`.') from ie + try: + import pyarrow as pa + except ImportError as ie: + raise ImportError( + '`pyarrow` is required for `to_polars()`. ' + 'Install with `pip install pyarrow`.') from ie + handle = self._take_cursor_handle() + schema, batches = _fetch_all_record_batches(handle, pa) + if schema is None: + return pl.from_arrow(pa.table({})) + table = pa.Table.from_batches(batches, schema) + if getattr(pl, 'Categories', None) is None: + return pl.from_arrow(table) + try: + return _polars_dataframe_hybrid(table, pl, pa, {}) + except Exception: + return pl.from_arrow(table) + + def _to_pandas_numpy(self): + return _numpy_frame_from_cursor(self._take_cursor_handle()) + + def iter_arrow(self): + """Iterate result batches as ``pyarrow.RecordBatch``. + + Streaming: a mid-query failover after the first batch has been + yielded surfaces ``IngressErrorCode.FailoverWouldDuplicate`` (the + already-yielded batches cannot be discarded); re-issue the query. + If the iterator is abandoned partway, cleanup runs at the next + garbage-collection cycle; call :meth:`close` (or use the context- + manager) for deterministic release. + """ + reader = _build_record_batch_reader(self._take_cursor_handle()) + for batch in reader: + yield batch + + def iter_pandas(self, *, dtype_backend=None, types_mapper=None): + """Iterate result batches as ``pandas.DataFrame``. + + Mirrors :meth:`to_pandas`: with no arguments each batch is + materialised straight into numpy (no pyarrow, sentinel-preserving, + ``df.attrs['questdb']`` per batch). ``dtype_backend`` / + ``types_mapper`` select the pyarrow-backed path instead. + """ + if dtype_backend is not None and types_mapper is not None: + raise ValueError( + 'pass at most one of dtype_backend, types_mapper') + if dtype_backend is None and types_mapper is None: + return _NumpyBatchIter(self._take_cursor_handle()) + return self._iter_pandas_arrow(dtype_backend, types_mapper) + + def _iter_pandas_arrow(self, dtype_backend, types_mapper): + import pyarrow as pa + kwargs = _resolve_arrow_to_pandas_kwargs(dtype_backend, types_mapper) + for batch in self.iter_arrow(): + table = _table_signed_dict_indices(pa.Table.from_batches([batch])) + yield table.to_pandas(**kwargs) + + def iter_polars(self): + """Iterate result batches as ``polars.DataFrame``. + + Mirrors :meth:`to_polars` per batch (same ``Categorical`` SYMBOL + handling) for streaming / low-peak-memory consumption. Every batch's + SYMBOL Categoricals share one persistent ``Categories`` identity, so + ``polars.concat`` over the yielded frames stitches without a + categories-mismatch error. + + Streaming: a mid-query failover after the first batch has been yielded + surfaces ``IngressErrorCode.FailoverWouldDuplicate``; re-issue the + query. Requires polars and pyarrow. + """ + try: + import polars as pl + except ImportError as ie: + raise ImportError( + '`polars` is required for `iter_polars()`. ' + 'Install with `pip install polars`.') from ie + try: + import pyarrow as pa + except ImportError as ie: + raise ImportError( + '`pyarrow` is required for `iter_polars()`. ' + 'Install with `pip install pyarrow`.') from ie + return _PolarsBatchIter(self._take_cursor_handle(), pl, pa) + + def cancel(self): + """Ask the server to stop streaming. Idempotent. + + Distinct from :meth:`close`: ``cancel`` sends a cancellation + frame to QuestDB so the server can drop in-flight work; + ``close`` only releases local resources. A subsequent batch + pull after ``cancel`` typically surfaces + ``IngressErrorCode.Cancelled``. + """ + cdef _CursorHandle handle = self._cursor_handle + cdef reader_error* err = NULL + cdef bint ok + cdef reader_cursor* cursor + if handle is None: + return + with handle._lock: + cursor = handle._cursor + if cursor == NULL: + return + with nogil: + ok = reader_cursor_cancel(cursor, &err) + if not ok: + if err != NULL: + raise _reader_err_to_py(err) + raise IngressError( + IngressErrorCode.ServerFlushError, + 'reader_cursor_cancel returned false ' + 'without setting err_out') + + def close(self): + """Release the cursor + reader. Idempotent. + + Does not send a cancellation frame; use :meth:`cancel` first if + you need the server to stop work. After ``close``, any + previously-returned iterator that hasn't been exhausted will + fail on its next pump with + ``IngressErrorCode.InvalidApiCall``. + """ + cdef _CursorHandle handle = self._cursor_handle + self._cursor_handle = None + self._consumed = True + if handle is not None: + handle._free() + + def __enter__(self): + return self + + def __exit__(self, exc_type, exc_val, exc_tb): + self.close() + return False diff --git a/src/questdb/extra_cpython.pxd b/src/questdb/extra_cpython.pxd index 3e794566..b42dd152 100644 --- a/src/questdb/extra_cpython.pxd +++ b/src/questdb/extra_cpython.pxd @@ -50,6 +50,8 @@ cdef extern from "Python.h": bint PyUnicode_CheckExact(PyObject* o) + const char* PyUnicode_AsUTF8AndSize(PyObject* o, Py_ssize_t* size) except NULL + bint PyBool_Check(PyObject* o) bint PyLong_CheckExact(PyObject* o) @@ -58,6 +60,8 @@ cdef extern from "Python.h": double PyFloat_AS_DOUBLE(PyObject* o) + double PyFloat_AsDouble(PyObject* o) except? -1.0 + long long PyLong_AsLongLong(PyObject* o) except? -1 PyObject* PyErr_Occurred() diff --git a/src/questdb/ingress.pyi b/src/questdb/ingress.pyi index 06278a84..e4dbc639 100644 --- a/src/questdb/ingress.pyi +++ b/src/questdb/ingress.pyi @@ -24,19 +24,30 @@ __all__ = [ "Buffer", + "Client", "IngressError", "IngressErrorCode", + "IngressServerRejectionError", "Protocol", + "QueryResult", "Sender", + "QwpWsError", + "QwpWsErrorCategory", + "QwpWsErrorPolicy", + "QwpWsProgress", + "ServerTimestamp", "ServerTimestampType", "TimestampMicros", "TimestampNanos", "TlsCa", + "UnsupportedDataFrameShapeError", + "WARN_HIGH_RECONNECTS", ] from datetime import datetime, timedelta from enum import Enum -from typing import Any, Dict, List, Optional, Union +from dataclasses import dataclass +from typing import Any, Callable, Dict, Iterator, List, Optional, Union import numpy as np import pandas as pd @@ -55,11 +66,18 @@ class IngressErrorCode(Enum): TlsError = ... HttpNotSupported = ... ServerFlushError = ... + ServerRejection = ... + RoleMismatch = ... ConfigError = ... ArrayError = ... ProtocolVersionError = ... DecimalError = ... + ArrowUnsupportedColumnKind = ... + ArrowIngest = ... + FailoverRetry = ... BadDataFrame = ... + Cancelled = ... + FailoverWouldDuplicate = ... class IngressError(Exception): @@ -69,6 +87,30 @@ class IngressError(Exception): def code(self) -> IngressErrorCode: """Return the error code.""" + @property + def qwp_ws_error(self) -> Optional["QwpWsError"]: + """ + Return the structured QWP/WebSocket HALT diagnostic, if this error + carries one from a terminal QWP/WebSocket sender failure. + """ + + +class IngressServerRejectionError(IngressError): + """ + A terminal QWP/WebSocket server rejection. + + The structured server payload is available through + :attr:`IngressError.qwp_ws_error`. + """ + + +class UnsupportedDataFrameShapeError(IngressError): + """ + A DataFrame shape is not supported by the optimized columnar client path. + """ + + column_failures: tuple + class ServerTimestampType: """ @@ -76,6 +118,11 @@ class ServerTimestampType: """ +ServerTimestamp: ServerTimestampType + +WARN_HIGH_RECONNECTS: bool + + class TimestampMicros: """ A timestamp in microseconds since the UNIX epoch (UTC). @@ -178,7 +225,6 @@ class TimestampNanos: def value(self) -> int: """Number of nanoseconds (Unix epoch timestamp, UTC).""" - class SenderTransaction: """ A transaction for a specific table. @@ -250,85 +296,65 @@ class SenderTransaction: class Buffer: """ - Construct QuestDB InfluxDB Line Protocol (ILP) messages. + Buffer for serializing rows before flushing through a + :func:`Sender `. - The :func:`Buffer.row` method is used to add a row to the buffer. + Use the factory class methods to create a buffer: - You can call this many times. + * :func:`Buffer.ilp` for ILP (InfluxDB Line Protocol) buffers. + * :func:`Buffer.qwp` for QWP (QuestWire Protocol) buffers. .. code-block:: python - from questdb.ingress import Buffer + from questdb.ingress import Buffer, Sender, Protocol, TimestampNanos - buf = Buffer(protocol_version=2) # or better yet, `sender.new_buffer()` + buf = Buffer.ilp(protocol_version=2) buf.row( - 'table_name1', - symbols={'s1', 'v1', 's2', 'v2'}, - columns={'c1': True, 'c2': 0.5}) + 'table_name', + symbols={'s1': 'v1'}, + columns={'c1': True, 'c2': 0.5}, + at=TimestampNanos.now()) - buf.row( - 'table_name2', - symbols={'questdb': '❤️'}, - columns={'like': 100000}) - - # Append any additional rows then, once ready, call - sender.flush(buffer) # a `Sender` instance. - - # The sender auto-cleared the buffer, ready for reuse. - - buf.row( - 'table_name1', - symbols={'s1', 'v1', 's2', 'v2'}, - columns={'c1': True, 'c2': 0.5}) - - # etc. - - In general, it's best to create a new buffer from a sender instance, - via the :func:`Sender.new_buffer` method, as this will ensure the buffer - is configured with the same protocol version and maximum name length - as the sender. - - Buffer Constructor Arguments: - * protocol_version (``int``): The protocol version to use. - * ``init_buf_size`` (``int``): Initial capacity of the buffer in bytes. - Defaults to ``65536`` (64KiB). - * ``max_name_len`` (``int``): Maximum length of a column name. - Defaults to ``127`` which is the same default value as QuestDB. - This should match the ``cairo.max.file.name.length`` setting of the - QuestDB instance you're connecting to. - - **Note**: Protocol version ``2`` requires QuestDB server version 9.0.0 or higher. - - .. code-block:: python - - # These two buffer constructions are equivalent. - buf1 = Buffer(protocol_version=2) - buf2 = Buffer(protocol_version=2, init_buf_size=65536, max_name_len=127) - - To avoid having to manually set these arguments every time, you can call - the sender's ``new_buffer()`` method instead. - - .. code-block:: python - - from questdb.ingress import Sender, Buffer + with Sender(Protocol.Http, 'localhost', 9000) as sender: + sender.flush(buf) - sender = Sender('http', 'localhost', 9009, - init_buf_size=16384) - buf = sender.new_buffer() - assert buf.init_buf_size == 16384 - assert buf.max_name_len == 127 + Alternatively, call :func:`Sender.new_buffer` which creates the + correct buffer type (ILP or QWP) matching the sender's protocol. """ def __init__( self, - *, protocol_version: int, init_buf_size: int = 65536, max_name_len: int = 127): """ - Create a new buffer with the an initial capacity and max name length. - :param int protocol_version: The protocol version to use. + .. deprecated:: + Use :func:`Buffer.ilp` or :func:`Buffer.qwp` instead. + """ + ... + + @staticmethod + def ilp( + protocol_version: int = 2, + init_buf_size: int = 65536, + max_name_len: int = 127) -> Buffer: + """ + Create an ILP (InfluxDB Line Protocol) buffer. + + :param int protocol_version: The protocol version to use (1-3). + :param int init_buf_size: Initial capacity of the buffer in bytes. + :param int max_name_len: Maximum length of a table or column name. + """ + ... + + @staticmethod + def qwp( + init_buf_size: int = 65536, + max_name_len: int = 127) -> Buffer: + """ + Create a QWP (QuestWire Protocol) buffer. + :param int init_buf_size: Initial capacity of the buffer in bytes. :param int max_name_len: Maximum length of a table or column name. """ @@ -592,7 +618,7 @@ class Buffer: import pandas as pd import questdb.ingress as qi - buf = qi.Buffer(protocol_version=2) + buf = qi.Buffer.ilp(protocol_version=2) # ... df = pd.DataFrame({ @@ -798,10 +824,56 @@ class Protocol(TaggedEnum): Tcps = ... Http = ... Https = ... + QwpUdp = ... + QwpWs = ... + QwpWss = ... @property def tls_enabled(self) -> bool: ... +class QwpWsProgress(TaggedEnum): + """ + Progress mode for QWP/WebSocket senders. + """ + + Background = ... + Manual = ... + +class QwpWsErrorCategory(TaggedEnum): + """ + Category of a structured QWP/WebSocket diagnostic. + """ + + SchemaMismatch = ... + ParseError = ... + InternalError = ... + SecurityError = ... + WriteError = ... + ProtocolViolation = ... + Unknown = ... + +class QwpWsErrorPolicy(TaggedEnum): + """ + Applied policy for a structured QWP/WebSocket diagnostic. + """ + + DropAndContinue = ... + Halt = ... + +@dataclass(frozen=True) +class QwpWsError: + """ + Structured QWP/WebSocket diagnostic. + """ + + category: QwpWsErrorCategory + applied_policy: QwpWsErrorPolicy + status: Optional[int] + message: str + message_sequence: Optional[int] + from_fsn: int + to_fsn: int + class TlsCa(TaggedEnum): """ Verification mechanism for the server's certificate. @@ -818,6 +890,123 @@ class TlsCa(TaggedEnum): WebpkiAndOsRoots = ... PemFile = ... +class Client: + """ + Pooled QWP/WebSocket client. + """ + + @staticmethod + def from_conf(conf_str: str) -> Client: + """ + Construct a pooled client from a QWP/WebSocket configuration string. + """ + + def __enter__(self) -> Client: ... + + def dataframe( + self, + df: Any, + *, + table_name: Optional[str] = None, + table_name_col: Union[None, int, str] = None, + symbols: Union[str, bool, List[int], List[str]] = "auto", + at: Union[int, str], + max_rows_per_batch: int = 16384, + schema_overrides: Optional[Dict[str, object]] = None, + ) -> Client: + """ + Ingest a dataframe through the pooled columnar QWP path. + + ``df`` accepts any of: + + - **pandas** ``pandas.DataFrame`` (NumPy-backed columns route + through the legacy planner; pyarrow-backed columns route + through the Arrow C Stream capsule path). + - **polars** ``polars.DataFrame`` and ``polars.LazyFrame``. + ``LazyFrame`` is materialised via + ``.collect(engine='streaming')`` (eager ``.collect()`` on + polars < 1.0). + - **pyarrow** ``pa.Table``, ``pa.RecordBatch``, and + ``pa.RecordBatchReader``. + - Any object exposing the Arrow C Data Interface — i.e. with + ``__arrow_c_stream__`` (duckdb / cudf / modin / pyarrow-backed + pandas 2.2+) or ``__arrow_c_array__`` (single Arrow array + exporters, wrapped into a one-batch ``pa.Table``). + """ + + def query(self, sql: str) -> QueryResult: + """ + Execute a SQL query and return a :class:`QueryResult`. + """ + + def reap_idle(self) -> int: + """ + Manually reap idle above-pool-size connections. + """ + + def close(self): + """ + Close the client and its connection pool. + """ + + def __exit__(self, exc_type, _exc_val, _exc_tb): ... + + +class QueryResult: + """ + Result of :meth:`Client.query`. Single-use: each materialisation + method consumes the underlying cursor. + """ + + def __arrow_c_stream__(self, requested_schema: Any = None) -> Any: ... + + def to_arrow(self) -> Any: + """Read the full result into a ``pyarrow.Table``. Requires pyarrow.""" + + def to_pandas( + self, + *, + dtype_backend: Optional[str] = None, + types_mapper: Optional[Callable[[Any], Any]] = None, + ) -> pd.DataFrame: + """Read the full result into a ``pandas.DataFrame``. With no arguments + the result is materialised via numpy (pyarrow-free); passing + ``dtype_backend`` or ``types_mapper`` selects the pyarrow path.""" + + def to_polars(self) -> Any: + """Read the full result into a ``polars.DataFrame``. Requires polars + and pyarrow.""" + + def iter_arrow(self) -> Iterator[Any]: + """Iterate result batches as ``pyarrow.RecordBatch``.""" + + def iter_polars(self) -> Iterator[Any]: + """Iterate result batches as ``polars.DataFrame`` (streaming / + low-peak-memory). Batches share one ``Categories`` identity so + ``polars.concat`` over them stitches cleanly. Requires polars and + pyarrow.""" + + def iter_pandas( + self, + *, + dtype_backend: Optional[str] = None, + types_mapper: Optional[Callable[[Any], Any]] = None, + ) -> Iterator[pd.DataFrame]: + """Iterate result batches as ``pandas.DataFrame``. With no arguments + the batches are materialised via numpy (pyarrow-free); passing + ``dtype_backend`` or ``types_mapper`` selects the pyarrow path.""" + + def cancel(self) -> None: + """Ask the server to stop streaming. Idempotent.""" + + def close(self) -> None: + """Release the cursor and reader. Idempotent.""" + + def __enter__(self) -> QueryResult: ... + + def __exit__(self, exc_type, exc_val, exc_tb): ... + + class Sender: """ Ingest data into QuestDB. @@ -841,14 +1030,20 @@ class Sender: tls_verify: bool = True, tls_ca: TlsCa = TlsCa.WebpkiRoots, tls_roots=None, + tls_roots_password: Optional[str] = None, max_buf_size: int = 104857600, retry_timeout: int = 10000, + retry_max_backoff: int = 1000, request_min_throughput: int = 102400, request_timeout=None, auto_flush: bool = True, auto_flush_rows: Optional[int] = None, auto_flush_bytes: bool = False, auto_flush_interval: int = 1000, + max_datagram_size: Optional[int] = None, + multicast_ttl: Optional[int] = None, + qwp_ws_progress: Optional[QwpWsProgress] = None, + qwp_ws_error_handler: Optional[Callable[["QwpWsError"], None]] = None, protocol_version=None, init_buf_size: int = 65536, max_name_len: int = 127, @@ -867,14 +1062,20 @@ class Sender: tls_verify: bool = True, tls_ca: TlsCa = TlsCa.WebpkiRoots, tls_roots=None, + tls_roots_password: Optional[str] = None, max_buf_size: int = 104857600, retry_timeout: int = 10000, + retry_max_backoff: int = 1000, request_min_throughput: int = 102400, request_timeout=None, auto_flush: bool = True, auto_flush_rows: Optional[int] = None, auto_flush_bytes: bool = False, auto_flush_interval: int = 1000, + max_datagram_size: Optional[int] = None, + multicast_ttl: Optional[int] = None, + qwp_ws_progress: Optional[QwpWsProgress] = None, + qwp_ws_error_handler: Optional[Callable[["QwpWsError"], None]] = None, protocol_version=None, init_buf_size: int = 65536, max_name_len: int = 127, @@ -903,14 +1104,20 @@ class Sender: tls_verify: bool = True, tls_ca: TlsCa = TlsCa.WebpkiRoots, tls_roots=None, + tls_roots_password: Optional[str] = None, max_buf_size: int = 104857600, retry_timeout: int = 10000, + retry_max_backoff: int = 1000, request_min_throughput: int = 102400, request_timeout=None, auto_flush: bool = True, auto_flush_rows: Optional[int] = None, auto_flush_bytes: bool = False, auto_flush_interval: int = 1000, + max_datagram_size: Optional[int] = None, + multicast_ttl: Optional[int] = None, + qwp_ws_progress: Optional[QwpWsProgress] = None, + qwp_ws_error_handler: Optional[Callable[["QwpWsError"], None]] = None, protocol_version=None, init_buf_size: int = 65536, max_name_len: int = 127, @@ -940,10 +1147,6 @@ class Sender: def init_buf_size(self) -> int: """The initial capacity of the sender's internal buffer.""" - @property - def max_name_len(self) -> int: - """Maximum length of a table or column name.""" - @property def auto_flush(self) -> bool: """ @@ -984,7 +1187,7 @@ class Sender: """ @property - def max_name_len(self): + def max_name_len(self) -> int: """ Returns the sender's maximum-configured maximum name length for table names and column names. @@ -1033,7 +1236,7 @@ class Sender: *, symbols: Optional[Dict[str, str]] = None, columns: Optional[ - Dict[str, Union[None, bool, int, float, str, TimestampMicros, datetime, np.ndarray]] + Dict[str, Union[None, bool, int, float, str, TimestampMicros, datetime, np.ndarray, Decimal]] ] = None, at: Union[TimestampNanos, datetime, ServerTimestampType], ) -> Sender: @@ -1118,6 +1321,12 @@ class Sender: :param buffer: The buffer to flush. If ``None``, the internal buffer is flushed. + With QWP/WebSocket, this publishes the buffer into the local sender + queue and returns before the server necessarily ACKs the frame. Later + terminal diagnostics fail subsequent sender calls and are available as + :attr:`IngressError.qwp_ws_error`. Server diagnostics are also + available through :func:`Sender.poll_qwp_ws_error`. + :param clear: If ``True``, the flushed buffer is cleared (default). If ``False``, the flushed buffer is left in the internal buffer. Note that ``clear=False`` is only supported if ``buffer`` is also @@ -1131,6 +1340,57 @@ class Sender: The Python GIL is released during the network IO operation. """ + def flush_and_get_fsn(self, buffer: Optional[Buffer] = None) -> Optional[int]: + """ + Publish a QWP/WebSocket buffer locally, clear it on success, and return + the assigned frame sequence number. + """ + + def flush_and_keep_and_get_fsn( + self, buffer: Optional[Buffer] = None + ) -> Optional[int]: + """ + Publish a QWP/WebSocket buffer locally without clearing it and return + the assigned frame sequence number. + """ + + def published_fsn(self) -> Optional[int]: + """ + Highest QWP/WebSocket frame sequence number published locally. + """ + + def acked_fsn(self) -> Optional[int]: + """ + Highest QWP/WebSocket frame sequence number completed by ACK or + drop-and-continue rejection. + """ + + def await_acked_fsn(self, fsn: int, timeout_millis: int) -> bool: + """ + Wait until the QWP/WebSocket completion watermark reaches ``fsn``. + """ + + def drive_once(self) -> bool: + """ + Drive one QWP/WebSocket progress step for manual progress senders. + """ + + def poll_qwp_ws_error(self) -> Optional[QwpWsError]: + """ + Poll the next structured QWP/WebSocket diagnostic. + """ + + def qwp_ws_errors_dropped(self) -> int: + """ + Number of QWP/WebSocket diagnostics dropped from the bounded ring. + """ + + def close_drain(self): + """ + Stop accepting new QWP/WebSocket publications and wait for already + published frames to resolve. + """ + def close(self, flush: bool = True): """ Disconnect. @@ -1140,6 +1400,8 @@ class Sender: Once a sender is closed, it can't be re-used. :param bool flush: If ``True``, flush the internal buffer before closing. + For QWP/WebSocket, this also drains already-published frames before + closing. """ def __exit__(self, exc_type, _exc_val, _exc_tb): diff --git a/src/questdb/ingress.pyx b/src/questdb/ingress.pyx index 17e43296..ac9a81dd 100644 --- a/src/questdb/ingress.pyx +++ b/src/questdb/ingress.pyx @@ -32,34 +32,50 @@ API for fast data ingestion into QuestDB. __all__ = [ 'Buffer', + 'Client', 'IngressError', 'IngressErrorCode', + 'IngressServerRejectionError', 'Protocol', + 'QueryResult', 'Sender', + 'QwpWsError', + 'QwpWsErrorCategory', + 'QwpWsErrorPolicy', + 'QwpWsProgress', 'ServerTimestamp', 'ServerTimestampType', 'TimestampMicros', 'TimestampNanos', 'TlsCa', + 'UnsupportedDataFrameShapeError', 'WARN_HIGH_RECONNECTS' ] # For prototypes: https://github.com/cython/cython/tree/master/Cython/Includes -from libc.stdint cimport uint8_t, uint64_t, int64_t, uint32_t, uintptr_t, \ - INT64_MAX, INT64_MIN +from libc.stdint cimport uint8_t, uint64_t, int64_t, int32_t, uint32_t, \ + uintptr_t, INT64_MAX, INT64_MIN from libc.stdlib cimport malloc, calloc, realloc, free, abort, qsort -from libc.string cimport strncmp, memset +from libc.string cimport strncmp, memset, memcpy, strlen from libc.math cimport isnan from libc.errno cimport errno # from libc.stdio cimport stderr, fprintf from cpython.datetime cimport datetime as cp_datetime from cpython.datetime cimport timedelta as cp_timedelta +from cpython.datetime cimport ( + PyDateTime_GET_YEAR, PyDateTime_GET_MONTH, PyDateTime_GET_DAY, + PyDateTime_DATE_GET_HOUR, PyDateTime_DATE_GET_MINUTE, + PyDateTime_DATE_GET_SECOND, PyDateTime_DATE_GET_MICROSECOND, +) from cpython.bool cimport bool from cpython.weakref cimport PyWeakref_NewRef, PyWeakref_GetObject from cpython.object cimport PyObject from cpython.buffer cimport Py_buffer, PyObject_CheckBuffer, \ PyObject_GetBuffer, PyBuffer_Release, PyBUF_SIMPLE from cpython.memoryview cimport PyMemoryView_FromMemory +from cpython.pycapsule cimport (PyCapsule_GetPointer, PyCapsule_IsValid, + PyCapsule_New) +from cpython.ref cimport Py_INCREF, Py_DECREF from .line_sender cimport * from .rpyutils cimport * @@ -75,12 +91,15 @@ ctypedef int void_int import cython include "dataframe.pxi" +include "egress.pxi" from enum import Enum from typing import List, Tuple, Dict, Union, Any, Optional, Callable, \ Iterable +from dataclasses import dataclass import pathlib -from cpython.bytes cimport PyBytes_FromStringAndSize +from cpython.bytes cimport (PyBytes_FromStringAndSize, + PyBytes_GET_SIZE, PyBytes_AsString) import sys import datetime @@ -90,6 +109,7 @@ import collections import time import heapq import warnings +import logging import numpy cimport numpy as cnp @@ -100,6 +120,15 @@ from .extra_numpy cimport * cnp.import_array() +cdef bint _dataframe_columnar_count_io_stats = False +cdef uint64_t _dataframe_columnar_flush_calls = 0 +cdef uint64_t _dataframe_columnar_flush_ns = 0 +cdef uint64_t _dataframe_columnar_sync_calls = 0 +cdef uint64_t _dataframe_columnar_sync_ns = 0 +cdef uint64_t _dataframe_columnar_flush_retry_syncs = 0 + +cdef size_t _QWP_MAX_DEFERRED_ARROW_FRAMES = 100 + # This value is automatically updated by the `bump2version` tool. # If you need to update it, also update the search definition in @@ -139,11 +168,24 @@ class IngressErrorCode(Enum): TlsError = line_sender_error_tls_error HttpNotSupported = line_sender_error_http_not_supported ServerFlushError = line_sender_error_server_flush_error + ServerRejection = line_sender_error_server_rejection + RoleMismatch = line_sender_error_role_mismatch ConfigError = line_sender_error_config_error ArrayError = line_sender_error_array_error ProtocolVersionError = line_sender_error_protocol_version_error DecimalError = line_sender_error_invalid_decimal - BadDataFrame = line_sender_error_invalid_decimal + 1 + ArrowUnsupportedColumnKind = line_sender_error_arrow_unsupported_column_kind + ArrowIngest = line_sender_error_arrow_ingest + FailoverRetry = line_sender_error_failover_retry + # Python-only sentinels with no backing line_sender_error_code. They sit + # in a reserved high band, permanently disjoint from the small contiguous + # FFI code space, so no appended line_sender_error_* variant can ever + # collide with (and silently alias) them. Compared by identity; their + # numeric value is never sent over FFI. + BadDataFrame = 0x10000 + Cancelled = 0x10001 + # Egress-only (reader_error_code 21); not a line_sender_error_code. + FailoverWouldDuplicate = 0x10002 def __str__(self) -> str: """Return the name of the enum.""" @@ -152,15 +194,48 @@ class IngressErrorCode(Enum): class IngressError(Exception): """An error whilst using the ``Sender`` or constructing its ``Buffer``.""" - def __init__(self, code, msg): + def __init__(self, code, msg, qwp_ws_error=None): super().__init__(msg) self._code = code + self._qwp_ws_error = qwp_ws_error @property def code(self) -> IngressErrorCode: """Return the error code.""" return self._code + @property + def qwp_ws_error(self): + """ + Return the structured QWP/WebSocket HALT diagnostic, if this error + carries one from a terminal QWP/WebSocket sender failure. + """ + if self._qwp_ws_error is not None: + self._qwp_ws_error = _qwp_ws_error_from_raw(self._qwp_ws_error) + return self._qwp_ws_error + + +class IngressServerRejectionError(IngressError): + """ + A terminal QWP/WebSocket server rejection. + + The structured server payload is available through + :attr:`IngressError.qwp_ws_error`. + """ + + +class UnsupportedDataFrameShapeError(IngressError): + """ + A DataFrame shape is not supported by the optimized columnar client path. + + The existing ``Sender.dataframe(...)`` row path may still support the + frame. ``column_failures`` carries structured per-column rejection details + where available. + """ + def __init__(self, msg, column_failures=None): + super().__init__(IngressErrorCode.BadDataFrame, msg) + self.column_failures = tuple(column_failures or ()) + cdef inline object c_err_code_to_py(line_sender_error_code code): if code == line_sender_error_could_not_resolve_addr: @@ -183,6 +258,10 @@ cdef inline object c_err_code_to_py(line_sender_error_code code): return IngressErrorCode.HttpNotSupported elif code == line_sender_error_server_flush_error: return IngressErrorCode.ServerFlushError + elif code == line_sender_error_server_rejection: + return IngressErrorCode.ServerRejection + elif code == line_sender_error_role_mismatch: + return IngressErrorCode.RoleMismatch elif code == line_sender_error_config_error: return IngressErrorCode.ConfigError elif code == line_sender_error_array_error: @@ -191,36 +270,75 @@ cdef inline object c_err_code_to_py(line_sender_error_code code): return IngressErrorCode.ProtocolVersionError elif code == line_sender_error_invalid_decimal: return IngressErrorCode.DecimalError + elif code == line_sender_error_arrow_unsupported_column_kind: + return IngressErrorCode.ArrowUnsupportedColumnKind + elif code == line_sender_error_arrow_ingest: + return IngressErrorCode.ArrowIngest + elif code == line_sender_error_failover_retry: + return IngressErrorCode.FailoverRetry else: raise ValueError('Internal error converting error code.') -cdef inline object c_err_to_code_and_msg(line_sender_error* err): +cdef inline object c_qwp_ws_error_view_to_raw( + line_sender_qwpws_error_view view): + cdef object message + if view.message == NULL: + message = '' + else: + message = PyUnicode_FromStringAndSize( + view.message, view.message_len) + return ( + view.category, + view.applied_policy, + view.status if view.has_status else None, + message, + view.message_sequence if view.has_message_sequence else None, + view.from_fsn, + view.to_fsn) + + +cdef inline object c_err_to_fields(line_sender_error* err): """Construct a ``SenderError`` from a C error, which will be freed.""" cdef line_sender_error_code code = line_sender_error_get_code(err) cdef size_t c_len = 0 cdef const char* c_msg = line_sender_error_msg(err, &c_len) - cdef object py_err + cdef line_sender_qwpws_error_view qwp_ws_view cdef object py_msg cdef object py_code + cdef object py_qwp_ws_error = None try: py_code = c_err_code_to_py(code) py_msg = PyUnicode_FromStringAndSize(c_msg, c_len) - return (py_code, py_msg) + if line_sender_error_qwpws_get_view(err, &qwp_ws_view): + py_qwp_ws_error = c_qwp_ws_error_view_to_raw(qwp_ws_view) + return (py_code, py_msg, py_qwp_ws_error) finally: line_sender_error_free(err) cdef inline object c_err_to_py(line_sender_error* err): """Construct an ``IngressError`` from a C error, which will be freed.""" - cdef object tup = c_err_to_code_and_msg(err) - return IngressError(tup[0], tup[1]) + cdef object tup = c_err_to_fields(err) + if tup[0] == IngressErrorCode.ServerRejection: + return IngressServerRejectionError(tup[0], tup[1], tup[2]) + return IngressError(tup[0], tup[1], tup[2]) cdef inline object c_err_to_py_fmt(line_sender_error* err, str fmt): """Construct an ``IngressError`` from a C error, which will be freed.""" - cdef object tup = c_err_to_code_and_msg(err) - return IngressError(tup[0], fmt.format(tup[1])) + cdef object tup = c_err_to_fields(err) + if tup[0] == IngressErrorCode.ServerRejection: + return IngressServerRejectionError(tup[0], fmt.format(tup[1]), tup[2]) + return IngressError(tup[0], fmt.format(tup[1]), tup[2]) + + +cdef inline void_int reserve_buffer( + line_sender_buffer* buffer, + size_t additional) except -1: + cdef line_sender_error* err = NULL + if not line_sender_buffer_reserve(buffer, additional, &err): + raise c_err_to_py(err) cdef object _utf8_decode_error( @@ -565,6 +683,7 @@ cdef class TimestampNanos: return f'TimestampNanos({self.value})' +cdef class Client cdef class Sender cdef class Buffer @@ -589,11 +708,21 @@ cdef bint _is_http_protocol(line_sender_protocol protocol): (protocol == line_sender_protocol_https)) +cdef bint _is_qwp_udp_protocol(line_sender_protocol protocol): + return protocol == line_sender_protocol_qwpudp + + +cdef bint _is_qwp_ws_protocol(line_sender_protocol protocol): + return ( + (protocol == line_sender_protocol_qwpws) or + (protocol == line_sender_protocol_qwpwss)) + + cdef class SenderTransaction: """ A transaction for a specific table. - Transactions are not supported with ILP/TCP, only ILP/HTTP. + Transactions are only supported with ILP/HTTP. The sender API can only operate on one transaction at a time. @@ -610,11 +739,10 @@ cdef class SenderTransaction: cdef bint _complete def __cinit__(self, Sender sender, str table_name): - if _is_tcp_protocol(sender._c_protocol): + if not _is_http_protocol(sender._c_protocol): raise IngressError( IngressErrorCode.InvalidApiCall, - "Transactions aren't supported for ILP/TCP, " + - "use ILP/HTTP instead.") + 'Transactions are only supported for ILP/HTTP.') self._sender = sender self._table_name = table_name self._complete = False @@ -749,70 +877,37 @@ cdef class SenderTransaction: cdef class Buffer: """ - Construct QuestDB InfluxDB Line Protocol (ILP) messages. - Version 1 is compatible with the InfluxDB Line Protocol. + Buffer for serializing rows before flushing through a + :func:`Sender `. - The :func:`Buffer.row` method is used to add a row to the buffer. + Use the factory class methods to create a buffer: - You can call this many times. + * :func:`Buffer.ilp` for ILP (InfluxDB Line Protocol) buffers. + * :func:`Buffer.qwp` for QWP (QuestWire Protocol) buffers. .. code-block:: python - from questdb.ingress import Buffer - - buf = Buffer() - buf.row( - 'table_name1', - symbols={'s1', 'v1', 's2', 'v2'}, - columns={'c1': True, 'c2': 0.5}) - - buf.row( - 'table_name2', - symbols={'questdb': '❤️'}, - columns={'like': 100000}) - - # Append any additional rows then, once ready, call - sender.flush(buffer) # a `Sender` instance. - - # The sender auto-cleared the buffer, ready for reuse. + from questdb.ingress import Buffer, Sender, Protocol, TimestampNanos + buf = Buffer.ilp(protocol_version=2) buf.row( - 'table_name1', - symbols={'s1', 'v1', 's2', 'v2'}, - columns={'c1': True, 'c2': 0.5}) - - # etc. - - - Buffer Constructor Arguments: - * protocol_version (``int``): The protocol version to use. - * ``init_buf_size`` (``int``): Initial capacity of the buffer in bytes. - Defaults to ``65536`` (64KiB). - * ``max_name_len`` (``int``): Maximum length of a column name. - Defaults to ``127`` which is the same default value as QuestDB. - This should match the ``cairo.max.file.name.length`` setting of the - QuestDB instance you're connecting to. - - **Note**: Protocol version ``2`` requires QuestDB server version 9.0.0 or higher. - - .. code-block:: python + 'table_name', + symbols={'s1': 'v1'}, + columns={'c1': True, 'c2': 0.5}, + at=TimestampNanos.now()) - # These two buffer constructions are equivalent. - buf1 = Buffer() - buf2 = Buffer(init_buf_size=65536, max_name_len=127) + with Sender(Protocol.Http, 'localhost', 9000) as sender: + sender.flush(buf) - To avoid having to manually set these arguments every time, you can call - the sender's ``new_buffer()`` method instead. + Alternatively, call :func:`Sender.new_buffer` which creates the + correct buffer type (ILP or QWP) matching the sender's protocol: .. code-block:: python - from questdb.ingress import Sender, Buffer + from questdb.ingress import Sender, Protocol - sender = Sender('http', 'localhost', 9009, - init_buf_size=16384, max_name_len=64) - buf = sender.new_buffer() - assert buf.init_buf_size == 16384 - assert buf.max_name_len == 64 + with Sender(Protocol.Http, 'localhost', 9000) as sender: + buf = sender.new_buffer() """ cdef line_sender_buffer* _impl @@ -821,22 +916,83 @@ cdef class Buffer: cdef size_t _max_name_len cdef object _row_complete_sender - def __cinit__(self, protocol_version: int, init_buf_size: int=65536, max_name_len: int=127): + def __cinit__(self): + self._impl = NULL + self._b = NULL + self._init_buf_size = 0 + self._max_name_len = 0 + self._row_complete_sender = None + + def __init__( + self, + protocol_version: int, + init_buf_size: int=65536, + max_name_len: int=127): + """ + .. deprecated:: + Use :func:`Buffer.ilp` or :func:`Buffer.qwp` instead. + """ + warnings.warn( + 'Buffer() is deprecated, use Buffer.ilp() or Buffer.qwp() instead.', + DeprecationWarning, + stacklevel=2) + if protocol_version not in range(1, 4): + raise IngressError( + IngressErrorCode.ProtocolVersionError, + 'Invalid protocol version. Supported versions are 1-3.') + self._init_ilp_impl(protocol_version, init_buf_size, max_name_len) + + @staticmethod + def ilp( + protocol_version: int=2, + init_buf_size: int=65536, + max_name_len: int=127): """ - Create a new buffer with the an initial capacity and max name length. + Create an ILP (InfluxDB Line Protocol) buffer. + + :param int protocol_version: The protocol version to use (1-3). + Defaults to ``2``. :param int init_buf_size: Initial capacity of the buffer in bytes. + Defaults to ``65536`` (64KiB). :param int max_name_len: Maximum length of a table or column name. + Defaults to ``127``. """ if protocol_version not in range(1, 4): raise IngressError( IngressErrorCode.ProtocolVersionError, 'Invalid protocol version. Supported versions are 1-3.') - self._cinit_impl(protocol_version, init_buf_size, max_name_len) + cdef Buffer buf = Buffer.__new__(Buffer) + buf._init_ilp_impl(protocol_version, init_buf_size, max_name_len) + return buf + + @staticmethod + def qwp( + init_buf_size: int=65536, + max_name_len: int=127): + """ + Create a QWP (QuestWire Protocol) buffer. + + :param int init_buf_size: Initial capacity of the buffer in bytes. + Defaults to ``65536`` (64KiB). + :param int max_name_len: Maximum length of a table or column name. + Defaults to ``127``. + """ + cdef Buffer buf = Buffer.__new__(Buffer) + buf._init_qwp_impl(init_buf_size, max_name_len) + return buf - cdef inline _cinit_impl(self, line_sender_protocol_version version, size_t init_buf_size, size_t max_name_len): + cdef inline _init_ilp_impl(self, line_sender_protocol_version version, size_t init_buf_size, size_t max_name_len): self._impl = line_sender_buffer_with_max_name_len(version, max_name_len) self._b = qdb_pystr_buf_new() - line_sender_buffer_reserve(self._impl, init_buf_size) + reserve_buffer(self._impl, init_buf_size) + self._init_buf_size = init_buf_size + self._max_name_len = max_name_len + self._row_complete_sender = None + + cdef inline _init_qwp_impl(self, size_t init_buf_size, size_t max_name_len): + self._impl = line_sender_buffer_new_qwp_with_max_name_len(max_name_len) + self._b = qdb_pystr_buf_new() + reserve_buffer(self._impl, init_buf_size) self._init_buf_size = init_buf_size self._max_name_len = max_name_len self._row_complete_sender = None @@ -847,6 +1003,12 @@ cdef class Buffer: qdb_pystr_buf_free(self._b) line_sender_buffer_free(self._impl) + cdef inline void_int _check_impl(self) except -1: + if self._impl == NULL: + raise IngressError( + IngressErrorCode.InvalidApiCall, + 'Buffer is not initialized.') + @property def init_buf_size(self) -> int: """ @@ -869,10 +1031,12 @@ cdef class Buffer: """ if additional < 0: raise ValueError('additional must be non-negative.') - line_sender_buffer_reserve(self._impl, additional) + self._check_impl() + reserve_buffer(self._impl, additional) def capacity(self) -> int: """The current buffer capacity.""" + self._check_impl() return line_sender_buffer_capacity(self._impl) def clear(self): @@ -885,6 +1049,7 @@ cdef class Buffer: This method is designed to be called only in conjunction with ``sender.flush(buffer, clear=False)``. """ + self._check_impl() line_sender_buffer_clear(self._impl) qdb_pystr_buf_clear(self._b) @@ -894,6 +1059,7 @@ cdef class Buffer: Equivalent (but cheaper) to ``len(bytes(buffer))``. """ + self._check_impl() return line_sender_buffer_size(self._impl) def __bytes__(self) -> bytes: @@ -901,6 +1067,7 @@ cdef class Buffer: return self._to_bytes() cdef inline object _to_bytes(self): + self._check_impl() cdef line_sender_buffer_view view = line_sender_buffer_peek(self._impl) return PyBytes_FromStringAndSize( view.buf, view.len) @@ -1110,6 +1277,7 @@ cdef class Buffer: Add a row to the buffer. """ cdef bint wrote_fields = False + self._check_impl() self._set_marker() try: self._table(table_name) @@ -1267,8 +1435,10 @@ cdef class Buffer: not using the buffer explicitly. It supports the same parameters and also supports auto-flushing. - This feature requires the ``pandas``, ``numpy`` and ``pyarrow`` - package to be installed. + Requires ``pandas`` and ``numpy``. ``pyarrow`` is only needed + when the frame contains ``pd.ArrowDtype`` / ``pd.Categorical`` / + ``string`` dtype columns — purely NumPy / object dtypes work + without it. Adding a dataframe can trigger auto-flushing behaviour, even between rows of the same dataframe. To avoid this, you can @@ -1359,7 +1529,7 @@ cdef class Buffer: import pandas as pd import questdb.ingress as qi - buf = qi.Buffer(protocol_version=2) + buf = qi.Buffer.ilp(protocol_version=2) # ... df = pd.DataFrame({ @@ -1533,6 +1703,7 @@ cdef class Buffer: IngressErrorCode.InvalidTimestamp, "`at` must be of type TimestampNanos, datetime, or ServerTimestamp" ) + self._check_impl() _dataframe( auto_flush_blank(), self._impl, @@ -1563,6 +1734,10 @@ cdef uint64_t _timedelta_to_millis(cp_timedelta timedelta): return millis +cdef bint _is_int_not_bool(object value): + return isinstance(value, int) and not isinstance(value, bool) + + cdef int64_t auto_flush_rows_default(line_sender_protocol protocol): if _is_http_protocol(protocol): return 75000 @@ -1576,14 +1751,18 @@ cdef void_int _parse_auto_flush( object auto_flush_rows, object auto_flush_bytes, object auto_flush_interval, - auto_flush_mode_t* c_auto_flush + auto_flush_mode_t* c_auto_flush, + size_t max_datagram_size ) except -1: # Set defaults. if auto_flush_rows is None: auto_flush_rows = auto_flush_rows_default(protocol) if auto_flush_bytes is None: - auto_flush_bytes = False + if _is_qwp_udp_protocol(protocol): + auto_flush_bytes = max_datagram_size if max_datagram_size else 1400 + else: + auto_flush_bytes = False if auto_flush_interval is None: auto_flush_interval = 1000 @@ -1729,12 +1908,13 @@ class TaggedEnum(Enum): """ if tag is None: return None + elif isinstance(tag, cls): + return tag elif isinstance(tag, str): for entry in cls: if entry.tag == tag: return entry - elif isinstance(tag, cls): - return tag + raise ValueError(f'Invalid value for {cls.__name__}: {tag!r}') else: raise ValueError(f'Invalid value for {cls.__name__}: {tag!r}') @@ -1749,10 +1929,124 @@ class Protocol(TaggedEnum): Tcps = ('tcps', 1) Http = ('http', 2) Https = ('https', 3) + QwpUdp = ('qwpudp', 4) + QwpWs = ('qwpws', 5) + QwpWss = ('qwpwss', 6) @property def tls_enabled(self): - return self in (Protocol.Tcps, Protocol.Https) + return self in (Protocol.Tcps, Protocol.Https, Protocol.QwpWss) + + +class QwpWsProgress(TaggedEnum): + """ + Progress mode for QWP/WebSocket senders. + """ + Background = ('background', LINE_SENDER_QWPWS_PROGRESS_BACKGROUND) + Manual = ('manual', LINE_SENDER_QWPWS_PROGRESS_MANUAL) + + +class QwpWsErrorCategory(TaggedEnum): + """ + Category of a structured QWP/WebSocket diagnostic. + """ + SchemaMismatch = ( + 'schema_mismatch', + LINE_SENDER_QWPWS_ERROR_SCHEMA_MISMATCH) + ParseError = ('parse_error', LINE_SENDER_QWPWS_ERROR_PARSE_ERROR) + InternalError = ('internal_error', LINE_SENDER_QWPWS_ERROR_INTERNAL_ERROR) + SecurityError = ('security_error', LINE_SENDER_QWPWS_ERROR_SECURITY_ERROR) + WriteError = ('write_error', LINE_SENDER_QWPWS_ERROR_WRITE_ERROR) + ProtocolViolation = ( + 'protocol_violation', + LINE_SENDER_QWPWS_ERROR_PROTOCOL_VIOLATION) + Unknown = ('unknown', LINE_SENDER_QWPWS_ERROR_UNKNOWN) + + +class QwpWsErrorPolicy(TaggedEnum): + """ + Applied policy for a structured QWP/WebSocket diagnostic. + """ + DropAndContinue = ( + 'drop_and_continue', + LINE_SENDER_QWPWS_ERROR_DROP_AND_CONTINUE) + Halt = ('halt', LINE_SENDER_QWPWS_ERROR_HALT) + + +@dataclass(frozen=True) +class QwpWsError: + category: QwpWsErrorCategory + applied_policy: QwpWsErrorPolicy + status: Optional[int] + message: str + message_sequence: Optional[int] + from_fsn: int + to_fsn: int + + +def _qwp_ws_error_from_raw(raw): + if raw is None or isinstance(raw, QwpWsError): + return raw + + ( + category, + applied_policy, + status, + message, + message_sequence, + from_fsn, + to_fsn, + ) = raw + + py_category = QwpWsErrorCategory.Unknown + for entry in QwpWsErrorCategory: + if entry.c_value == category: + py_category = entry + break + + py_policy = QwpWsErrorPolicy.Halt + for entry in QwpWsErrorPolicy: + if entry.c_value == applied_policy: + py_policy = entry + break + + return QwpWsError( + py_category, + py_policy, + status, + message, + message_sequence, + from_fsn, + to_fsn) + + +def _default_qwp_ws_error_handler(error): + level = ( + logging.ERROR + if error.applied_policy is QwpWsErrorPolicy.Halt + else logging.WARNING) + logging.getLogger("questdb.ingress").log( + level, + "QWP/WebSocket server rejection: " + "category=%s policy=%s status=%s fsn=[%s,%s] seq=%s message=%s", + error.category.tag, + error.applied_policy.tag, + error.status, + error.from_fsn, + error.to_fsn, + error.message_sequence, + error.message) + + +cdef void _qwp_ws_error_trampoline( + void* user_data, + const line_sender_qwpws_error_view* view) noexcept with gil: + cdef object handler = user_data + try: + handler(_qwp_ws_error_from_raw(c_qwp_ws_error_view_to_raw(view[0]))) + except BaseException: + logging.getLogger("questdb.ingress").exception( + "QWP/WebSocket error handler failed") class TlsCa(TaggedEnum): @@ -1805,17 +2099,20 @@ cdef object parse_conf_str( if c_conf_str == NULL: raise c_parse_conf_err_to_py(err) - c_buf1 = questdb_conf_str_service(c_conf_str, &c_len1) - service = PyUnicode_FromStringAndSize(c_buf1, c_len1) - - c_iter = questdb_conf_str_iter_pairs(c_conf_str) - while questdb_conf_str_iter_next(c_iter, &c_buf1, &c_len1, &c_buf2, &c_len2): - key = PyUnicode_FromStringAndSize(c_buf1, c_len1) - value = PyUnicode_FromStringAndSize(c_buf2, c_len2) - params[key] = value - - questdb_conf_str_iter_free(c_iter) - questdb_conf_str_free(c_conf_str) + c_iter = NULL + try: + c_buf1 = questdb_conf_str_service(c_conf_str, &c_len1) + service = PyUnicode_FromStringAndSize(c_buf1, c_len1) + + c_iter = questdb_conf_str_iter_pairs(c_conf_str) + while questdb_conf_str_iter_next(c_iter, &c_buf1, &c_len1, &c_buf2, &c_len2): + key = PyUnicode_FromStringAndSize(c_buf1, c_len1) + value = PyUnicode_FromStringAndSize(c_buf2, c_len2) + params[key] = value + finally: + if c_iter != NULL: + questdb_conf_str_iter_free(c_iter) + questdb_conf_str_free(c_conf_str) # We now need to parse the various values in the dict from their # string values to their Python types, as expected by the overrides @@ -1824,6 +2121,8 @@ cdef object parse_conf_str( # are kept as strings and are parsed by Sender._set_sender_fields. type_mappings = { 'bind_interface': str, + 'max_datagram_size': int, + 'multicast_ttl': int, 'username': str, 'password': str, 'token': str, @@ -1833,8 +2132,10 @@ cdef object parse_conf_str( 'tls_verify': str, 'tls_ca': str, 'tls_roots': str, + 'tls_roots_password': str, 'max_buf_size': int, 'retry_timeout': int, + 'retry_max_backoff_millis': int, 'request_min_throughput': int, 'request_timeout': int, 'auto_flush': str, @@ -1843,6 +2144,7 @@ cdef object parse_conf_str( 'auto_flush_interval': str, 'init_buf_size': int, 'max_name_len': int, + 'qwp_ws_progress': str, } params = { k: type_mappings.get(k, str)(v) @@ -1851,165 +2153,3607 @@ cdef object parse_conf_str( return (Protocol.parse(service), params) -cdef class Sender: - """ - Ingest data into QuestDB. +cdef str conf_str_value(object value): + return str(value).replace(';', ';;') - See the :ref:`sender` documentation for more information. - """ - # We need the Buffer held by a Sender can hold a weakref to its Sender. - # This avoids a circular reference that requires the GC to clean up. - cdef object __weakref__ +cdef bint _dataframe_columnar_has_single_contiguous_chunk( + col_t* col, + size_t row_count) noexcept nogil: + cdef ArrowArray* arr + if col.setup.chunks.n_chunks != 1: + return False + if col.setup.chunks.chunks == NULL: + return False + arr = &col.setup.chunks.chunks[0] + return ( + arr.offset == 0 and + arr.length == row_count and + arr.buffers != NULL and + arr.buffers[1] != NULL) + + +cdef bint _dataframe_columnar_i64_has_nat( + const int64_t* data, + size_t row_count) noexcept nogil: + cdef size_t row_index + for row_index in range(row_count): + if data[row_index] == _NAT: + return True + return False - cdef line_sender_protocol _c_protocol - cdef line_sender_opts* _opts - cdef line_sender* _impl - cdef Buffer _buffer - cdef auto_flush_mode_t _auto_flush_mode - cdef int64_t* _last_flush_ms - cdef size_t _init_buf_size - cdef bint _in_txn - cdef int64_t _slot_id - cdef void_int _set_sender_fields( - self, - qdb_pystr_buf* b, - object protocol, - str bind_interface, - str username, - str password, - str token, - str token_x, - str token_y, - object auth_timeout, - object tls_verify, - object tls_ca, - object tls_roots, - object max_buf_size, - object retry_timeout, - object request_min_throughput, - object request_timeout, - object auto_flush, - object auto_flush_rows, - object auto_flush_bytes, - object auto_flush_interval, - object protocol_version, - object init_buf_size, - object max_name_len) except -1: - """ - Set optional parameters for the sender. - """ - cdef line_sender_error* err = NULL - cdef str user_agent = 'questdb/python/' + VERSION - cdef line_sender_utf8 c_user_agent - cdef line_sender_utf8 c_bind_interface - cdef line_sender_utf8 c_username - cdef line_sender_utf8 c_password - cdef line_sender_utf8 c_token - cdef line_sender_utf8 c_token_x - cdef line_sender_utf8 c_token_y - cdef uint64_t c_auth_timeout - cdef bint c_tls_verify - cdef line_sender_ca c_tls_ca - cdef line_sender_utf8 c_tls_roots - cdef uint64_t c_max_buf_size - cdef uint64_t c_retry_timeout - cdef uint64_t c_request_min_throughput - cdef uint64_t c_request_timeout +cdef bint _dataframe_columnar_i64_has_negative( + const int64_t* data, + size_t row_count) noexcept nogil: + cdef size_t row_index + for row_index in range(row_count): + if data[row_index] < 0: + return True + return False - self._c_protocol = protocol.c_value - # It's OK to override this setting. - str_to_utf8(b, user_agent, &c_user_agent) - if not line_sender_opts_user_agent(self._opts, c_user_agent, &err): - raise c_err_to_py(err) +cdef int _dataframe_columnar_ts_field_scan( + ArrowArray* arr, + const int64_t* data, + size_t row_count) noexcept nogil: + # 0: ok, 1: NaT in a non-null row, 2: pre-epoch value in a non-null row. + # Null rows (cleared validity bit) carry an undefined physical value and + # are skipped; the column is sent with its validity bitmap. + cdef size_t row_index + cdef const uint8_t* validity = NULL + if arr.null_count != 0: + validity = arr.buffers[0] + for row_index in range(row_count): + if validity != NULL and not ( + validity[row_index >> 3] & (1 << (row_index & 7))): + continue + if data[row_index] == _NAT: + return 1 + if data[row_index] < 0: + return 2 + return 0 + + +cdef const column_sender_validity* _dataframe_columnar_validity( + ArrowArray* arr, + size_t row_offset, + size_t row_count, + column_sender_validity* validity) except? NULL: + if arr.null_count == 0: + return NULL + if row_offset % 8 != 0: + raise RuntimeError( + 'Columnar validity slices must start at byte-aligned row offsets.') + validity.bits = (arr.buffers[0]) + (row_offset // 8) + validity.bit_len = row_count + return validity + + +cdef bint _dataframe_columnar_has_validity( + ArrowArray* arr) noexcept nogil: + return arr.null_count == 0 or arr.buffers[0] != NULL + + +cdef bint _dataframe_columnar_has_utf8_values( + ArrowArray* arr, bint large_offsets) noexcept nogil: + if not (arr.n_buffers >= 3 and + arr.buffers != NULL and + arr.buffers[1] != NULL): + return False + if arr.length == 0 or arr.buffers[2] != NULL: + return True + # NULL byte buffer is valid only with zero data bytes (all-null/empty). + if large_offsets: + return (arr.buffers[1])[arr.offset + arr.length] == 0 + return (arr.buffers[1])[arr.offset + arr.length] == 0 - if bind_interface is not None: - str_to_utf8(b, bind_interface, &c_bind_interface) - if not line_sender_opts_bind_interface( - self._opts, c_bind_interface, &err): - raise c_err_to_py(err) - if username is not None: - str_to_utf8(b, username, &c_username) - if not line_sender_opts_username(self._opts, c_username, &err): - raise c_err_to_py(err) +cdef bint _dataframe_columnar_has_utf8_dictionary( + ArrowArray* arr) noexcept nogil: + cdef ArrowArray* dictionary = arr.dictionary + if dictionary == NULL: + return False + return ( + dictionary.offset == 0 and + dictionary.n_buffers >= 3 and + dictionary.buffers != NULL and + dictionary.buffers[1] != NULL and + (dictionary.length == 0 or dictionary.buffers[2] != NULL)) - if password is not None: - str_to_utf8(b, password, &c_password) - if not line_sender_opts_password(self._opts, c_password, &err): - raise c_err_to_py(err) - if token is not None: - str_to_utf8(b, token, &c_token) - if not line_sender_opts_token(self._opts, c_token, &err): - raise c_err_to_py(err) +cdef bint _dataframe_columnar_plan_has_validity( + dataframe_plan_t* plan) noexcept nogil: + """ + True when chunk row boundaries must be byte-aligned (multiples of 8). + Triggers for: + - Any Arrow column whose `null_count != 0` (the encoder reads a + validity bitmap and our slicing requires byte-alignment). + - Any PyObject source. The planner can't see the nulls until the + build phase walks the column; we conservatively assume they + might be present and require alignment. + - col_source_bool_pyobj specifically packs its VALUES into an + LSB-first bitmap; the emit shift `row_offset // 8` requires + alignment whether or not nulls are present. + """ + cdef size_t col_index + cdef ArrowArray* arr + cdef col_t* col + for col_index in range(plan.col_count): + col = &plan.cols.d[col_index] + if _is_pyobj_source(col.setup.source): + return True + arr = &col.setup.chunks.chunks[0] + if arr.null_count != 0: + return True + return False - if token_x is not None: - str_to_utf8(b, token_x, &c_token_x) - if not line_sender_opts_token_x(self._opts, c_token_x, &err): - raise c_err_to_py(err) - if token_y is not None: - str_to_utf8(b, token_y, &c_token_y) - if not line_sender_opts_token_y(self._opts, c_token_y, &err): - raise c_err_to_py(err) +cdef size_t _dataframe_columnar_rows_per_chunk( + dataframe_plan_t* plan, + size_t max_rows_per_chunk) noexcept nogil: + # Clamp to a hard safety upper bound and align to 8 rows when the plan + # carries a validity bitmap (chunk boundary must be byte-aligned). + cdef size_t rows_per_chunk = max_rows_per_chunk + if rows_per_chunk > 1000000: + rows_per_chunk = 1000000 + if rows_per_chunk == 0: + rows_per_chunk = 1 + if _dataframe_columnar_plan_has_validity(plan): + if rows_per_chunk < 8 and rows_per_chunk < plan.row_count: + rows_per_chunk = 8 + elif rows_per_chunk > 8: + rows_per_chunk -= rows_per_chunk % 8 + if rows_per_chunk == 0: + rows_per_chunk = 8 + return rows_per_chunk + + +cdef object _dataframe_columnar_global_failure(str reason): + return { + 'column': None, + 'target': None, + 'source_code': None, + 'reason': reason, + } - if protocol_version is not None: - if protocol_version == 'auto': + +cdef object _dataframe_columnar_col_failure( + object df, + col_t* col, + str reason): + return { + 'column': df.columns[col.setup.orig_index], + 'target': _TARGET_NAMES[col.setup.target], + 'source_code': col.setup.source, + 'reason': reason, + } + + +cdef object _dataframe_columnar_plan_normalizations( + object df, + dataframe_plan_t* plan): + cdef list normalizations = [] + cdef size_t col_index + cdef col_t* col + + for col_index in range(plan.col_count): + col = &plan.cols.d[col_index] + if col.setup.large_string_cast_to_utf8: + # Cast is performed for the row-path planner-shared with + # this columnar path; the columnar emitter would handle + # `U` natively, but the planner produced `u` by the time + # it reaches us. Reported for symmetry with the support + # report's existing schema. + normalizations.append({ + 'column': df.columns[col.setup.orig_index], + 'target': _TARGET_NAMES[col.setup.target], + 'source_code': col.setup.source, + 'action': 'arrow_large_string_cast_to_utf8', + 'copy_expected': True, + }) + return normalizations + + +cdef object _dataframe_columnar_plan_failures( + object df, + dataframe_plan_t* plan): + cdef list failures = [] + cdef size_t col_index + cdef size_t field_count = 0 + cdef col_t* col + cdef const int64_t* ts_data + cdef int ts_scan + + if (plan.col_count == 0) or (plan.row_count == 0): + return failures + + if plan.c_table_name.buf == NULL: + failures.append(_dataframe_columnar_global_failure( + 'v1 requires a fixed table_name; table_name_col is not supported.')) + + if plan.at_value != _AT_IS_SET_BY_COLUMN: + failures.append(_dataframe_columnar_global_failure( + 'v1 requires at to be a non-null DataFrame timestamp column.')) + + for col_index in range(plan.col_count): + col = &plan.cols.d[col_index] + if col.setup.target == col_target_t.col_target_skip: + continue + if col.setup.target == col_target_t.col_target_table: + failures.append(_dataframe_columnar_col_failure( + df, col, 'table-name columns are not supported in v1.')) + continue + if col.setup.target != col_target_t.col_target_at: + field_count += 1 + if not _dataframe_columnar_has_single_contiguous_chunk( + col, plan.row_count): + failures.append(_dataframe_columnar_col_failure( + df, col, 'v1 requires one contiguous zero-offset buffer.')) + continue + if not _dataframe_columnar_has_validity( + &col.setup.chunks.chunks[0]): + failures.append(_dataframe_columnar_col_failure( + df, col, 'v1 requires a zero-offset validity bitmap when ' + 'nulls are present.')) + continue + + if col.setup.target == col_target_t.col_target_column_bool: + if col.setup.source not in ( + col_source_t.col_source_bool_pyobj, + col_source_t.col_source_bool_numpy): + failures.append(_dataframe_columnar_col_failure( + df, + col, + 'v1 only supports object-dtype bool or NumPy bool ' + 'columns; Arrow nullable bool not yet supported.')) + elif col.setup.target == col_target_t.col_target_column_i64: + if col.setup.source not in ( + col_source_t.col_source_i64_numpy, + col_source_t.col_source_i8_numpy, + col_source_t.col_source_i16_numpy, + col_source_t.col_source_i32_numpy, + col_source_t.col_source_u8_numpy, + col_source_t.col_source_u16_numpy, + col_source_t.col_source_u32_numpy, + col_source_t.col_source_u64_numpy, + col_source_t.col_source_u32_arrow, + col_source_t.col_source_i64_arrow, + col_source_t.col_source_int_pyobj): + failures.append(_dataframe_columnar_col_failure( + df, + col, + 'v1 only supports NumPy signed/unsigned int columns, ' + 'Arrow uint32/int64 columns, or object-dtype int ' + 'columns.')) + elif col.setup.target == col_target_t.col_target_column_f64: + if col.setup.source not in ( + col_source_t.col_source_f64_numpy, + col_source_t.col_source_f32_numpy, + col_source_t.col_source_f64_arrow, + col_source_t.col_source_float_pyobj): + failures.append(_dataframe_columnar_col_failure( + df, + col, + 'v1 only supports NumPy float32/float64, Arrow ' + 'float64, or object-dtype float columns.')) + elif col.setup.target == col_target_t.col_target_column_ts: + if col.setup.source not in ( + col_source_t.col_source_dt64ns_numpy, + col_source_t.col_source_dt64us_numpy, + col_source_t.col_source_dt64ns_tz_arrow, + col_source_t.col_source_dt64us_tz_arrow): + failures.append(_dataframe_columnar_col_failure( + df, + col, + 'v1 only supports NumPy datetime64[ns/us] or ' + 'tz-aware datetime64/timestamp[pyarrow] ' + 'timestamp field columns.')) + else: + ts_data = col.setup.chunks.chunks[0].buffers[1] + ts_scan = _dataframe_columnar_ts_field_scan( + &col.setup.chunks.chunks[0], ts_data, plan.row_count) + if ts_scan == 1: + failures.append(_dataframe_columnar_col_failure( + df, + col, + 'v1 timestamp field columns cannot contain NaT.')) + elif ts_scan == 2: + failures.append(_dataframe_columnar_col_failure( + df, + col, + 'v1 timestamp field columns cannot contain ' + 'timestamps before the Unix epoch.')) + elif col.setup.target == col_target_t.col_target_column_str: + if col.setup.source == col_source_t.col_source_str_pyobj: + # PyObject sources are validated by the pre-build phase + # at row level (one walk catches all rows). The planner + # has nothing more to check here. pass - elif (protocol_version == 1) or (protocol_version == '1'): - if not line_sender_opts_protocol_version( - self._opts, line_sender_protocol_version_1, &err): - raise c_err_to_py(err) - elif (protocol_version == 2) or (protocol_version == '2'): - if not line_sender_opts_protocol_version( - self._opts, line_sender_protocol_version_2, &err): - raise c_err_to_py(err) - elif (protocol_version == 3) or (protocol_version == '3'): - if not line_sender_opts_protocol_version( - self._opts, line_sender_protocol_version_3, &err): - raise c_err_to_py(err) + elif col.setup.source in ( + col_source_t.col_source_str_i8_cat, + col_source_t.col_source_str_i16_cat, + col_source_t.col_source_str_i32_cat): + if not _dataframe_columnar_has_utf8_dictionary( + &col.setup.chunks.chunks[0]): + failures.append(_dataframe_columnar_col_failure( + df, + col, + 'v1 requires Arrow UTF-8 or LargeUtf8 dictionary ' + 'offsets and byte buffers for categorical columns.')) + elif col.setup.source not in ( + col_source_t.col_source_str_utf8_arrow, + col_source_t.col_source_str_lrg_utf8_arrow): + failures.append(_dataframe_columnar_col_failure( + df, + col, + 'v1 only supports string[pyarrow] columns backed by ' + 'Arrow UTF-8 or LargeUtf8, pandas string Categorical, ' + 'or object-dtype str.')) + elif not _dataframe_columnar_has_utf8_values( + &col.setup.chunks.chunks[0], + col.setup.source == + col_source_t.col_source_str_lrg_utf8_arrow): + failures.append(_dataframe_columnar_col_failure( + df, + col, + 'v1 requires Arrow UTF-8 or LargeUtf8 offsets and byte buffers.')) + elif col.setup.target == col_target_t.col_target_symbol: + if col.setup.source in ( + col_source_t.col_source_str_i8_cat, + col_source_t.col_source_str_i16_cat, + col_source_t.col_source_str_i32_cat): + if not _dataframe_columnar_has_utf8_dictionary( + &col.setup.chunks.chunks[0]): + failures.append(_dataframe_columnar_col_failure( + df, + col, + 'v1 requires Arrow UTF-8 or LargeUtf8 dictionary ' + 'offsets and byte buffers for categorical symbols.')) + elif col.setup.source in ( + col_source_t.col_source_str_utf8_arrow, + col_source_t.col_source_str_lrg_utf8_arrow): + if not _dataframe_columnar_has_utf8_values( + &col.setup.chunks.chunks[0], + col.setup.source == + col_source_t.col_source_str_lrg_utf8_arrow): + failures.append(_dataframe_columnar_col_failure( + df, + col, + 'v1 requires Arrow UTF-8 or LargeUtf8 offsets and ' + 'byte buffers.')) else: - raise IngressError( - IngressErrorCode.ConfigError, - '"protocol_version" must be None, "auto", 1-3' + - f' not {protocol_version!r}') - - if auth_timeout is not None: - if isinstance(auth_timeout, int): - c_auth_timeout = auth_timeout - elif isinstance(auth_timeout, cp_timedelta): - c_auth_timeout = _timedelta_to_millis(auth_timeout) + failures.append(_dataframe_columnar_col_failure( + df, + col, + 'v1 only supports pandas string Categorical or ' + 'string[pyarrow] symbol columns.')) + elif col.setup.target == col_target_t.col_target_at: + if col.setup.source not in ( + col_source_t.col_source_dt64ns_numpy, + col_source_t.col_source_dt64us_numpy, + col_source_t.col_source_dt64ns_tz_arrow, + col_source_t.col_source_dt64us_tz_arrow, + col_source_t.col_source_dt64ms_tz_arrow, + col_source_t.col_source_dt64s_tz_arrow): + failures.append(_dataframe_columnar_col_failure( + df, + col, + 'v1 only supports NumPy datetime64[ns/us] or ' + 'tz-aware datetime64/timestamp[pyarrow] ' + 'designated timestamp columns.')) + elif (col.setup.source in ( + col_source_t.col_source_dt64ns_tz_arrow, + col_source_t.col_source_dt64us_tz_arrow, + col_source_t.col_source_dt64ms_tz_arrow, + col_source_t.col_source_dt64s_tz_arrow) + and col.setup.chunks.chunks[0].null_count != 0): + failures.append(_dataframe_columnar_col_failure( + df, + col, + 'v1 designated timestamp columns cannot contain nulls.')) else: - raise TypeError( - '"auth_timeout" must be an int or a timedelta, ' - f'not {_fqn(type(auth_timeout))}') - if not line_sender_opts_auth_timeout(self._opts, c_auth_timeout, &err): - raise c_err_to_py(err) + ts_data = col.setup.chunks.chunks[0].buffers[1] + if _dataframe_columnar_i64_has_nat(ts_data, plan.row_count): + failures.append(_dataframe_columnar_col_failure( + df, + col, + 'v1 designated timestamp columns cannot contain NaT.')) + elif _dataframe_columnar_i64_has_negative( + ts_data, plan.row_count): + failures.append(_dataframe_columnar_col_failure( + df, + col, + 'v1 designated timestamp columns cannot contain ' + 'timestamps before the Unix epoch.')) + elif col.setup.target in ( + col_target_t.col_target_column_i8, + col_target_t.col_target_column_i16, + col_target_t.col_target_column_i32, + col_target_t.col_target_column_f32, + col_target_t.col_target_column_uuid, + col_target_t.col_target_column_long256, + col_target_t.col_target_column_ipv4, + col_target_t.col_target_column_binary, + col_target_t.col_target_column_arrow): + # Column-QWP-only targets reached via `_FIELD_TARGETS_QWP`. + # Each currently reachable target's source-set in + # `_TARGET_TO_SOURCES` is a singleton, so the source is + # already constrained by routing. The contiguous-buffer + + # validity checks above cover layout; the per-type FFI + # handles the wire encoding. col_target_column_arrow delegates + # type validation to the Rust importer. + pass + else: + failures.append(_dataframe_columnar_col_failure( + df, + col, + f'v1 does not support {_TARGET_NAMES[col.setup.target]} ' + 'columns.')) - if tls_verify is not None: - if (tls_verify is True) or (tls_verify == 'on'): - c_tls_verify = True - elif (tls_verify is False) or (tls_verify == 'unsafe_off'): - c_tls_verify = False - else: - raise ValueError( - '"tls_verify" must be a bool, "on" or "unsafe_off", ' - f'not {tls_verify!r}') - if not line_sender_opts_tls_verify(self._opts, c_tls_verify, &err): - raise c_err_to_py(err) + if field_count == 0: + failures.append(_dataframe_columnar_global_failure( + 'v1 requires at least one non-timestamp data column.')) - if tls_roots is not None: - tls_roots = str(tls_roots) - str_to_utf8(b, tls_roots, &c_tls_roots) - if not line_sender_opts_tls_roots(self._opts, c_tls_roots, &err): - raise c_err_to_py(err) + return failures - if tls_ca is not None: - c_tls_ca = TlsCa.parse(tls_ca).c_value - if not line_sender_opts_tls_ca(self._opts, c_tls_ca, &err): + +cdef void_int _dataframe_columnar_validate_plan( + object df, + dataframe_plan_t* plan) except -1: + cdef object failures = _dataframe_columnar_plan_failures(df, plan) + if failures: + raise UnsupportedDataFrameShapeError( + 'DataFrame is not supported by Client.dataframe() columnar v1.', + failures) + + +cdef bint _is_pyobj_source(col_source_t source) noexcept nogil: + return ( + source == col_source_t.col_source_str_pyobj or + source == col_source_t.col_source_int_pyobj or + source == col_source_t.col_source_float_pyobj or + source == col_source_t.col_source_bool_pyobj or + source == col_source_t.col_source_uuid_pyobj or + source == col_source_t.col_source_ipv4_pyobj or + source == col_source_t.col_source_datetime_pyobj or + source == col_source_t.col_source_bytes_pyobj) + + +cdef inline void _pyobj_set_validity_bit(uint8_t* bitmap, size_t row) noexcept nogil: + bitmap[row >> 3] |= (1 << (row & 7)) + + +cdef pyobj_built_t* _dataframe_columnar_build_str_pyobj( + col_t* col, + size_t row_count, + object df_col_name) except NULL: + """ + Walk a PyObject column once and produce Arrow-Utf8-shaped buffers + (int32 offsets + uint8 bytes + LSB-packed validity). Encoding uses + Python's str.encode('utf-8') so any valid Python str produces valid + UTF-8. + """ + cdef pyobj_built_t* b = calloc(1, sizeof(pyobj_built_t)) + if b == NULL: + raise MemoryError() + b.row_count = row_count + + cdef PyObject** access = col.setup.chunks.chunks[0].buffers[1] + cdef PyObject* cell + cdef size_t i + cdef Py_ssize_t utf8_len + cdef const char* utf8_buf + cdef size_t validity_bytes = (row_count + 7) // 8 + cdef size_t bytes_cap = 16 + cdef uint8_t* new_bytes + cdef size_t bytes_used = 0 + + try: + b.str_offsets = calloc(row_count + 1, sizeof(int32_t)) + if b.str_offsets == NULL: + raise MemoryError() + if validity_bytes > 0: + b.validity = calloc(validity_bytes, sizeof(uint8_t)) + if b.validity == NULL: + raise MemoryError() + b.str_bytes = malloc(bytes_cap) + if b.str_bytes == NULL: + raise MemoryError() + + for i in range(row_count): + cell = access[i] + if PyUnicode_CheckExact(cell): + utf8_buf = PyUnicode_AsUTF8AndSize(cell, &utf8_len) + if bytes_used + utf8_len > 2_147_483_647: + raise IngressError( + IngressErrorCode.BadDataFrame, + f'Bad column {df_col_name!r}: column total UTF-8 ' + 'bytes exceeds the QWP wire varchar offset table ' + 'limit (2 GiB).') + while bytes_used + utf8_len > bytes_cap: + bytes_cap *= 2 + new_bytes = realloc(b.str_bytes, bytes_cap) + if new_bytes == NULL: + raise MemoryError() + b.str_bytes = new_bytes + if utf8_len > 0: + memcpy(b.str_bytes + bytes_used, utf8_buf, utf8_len) + bytes_used += utf8_len + b.str_offsets[i + 1] = bytes_used + if b.validity != NULL: + _pyobj_set_validity_bit(b.validity, i) + elif _dataframe_is_null_pyobj(cell): + b.str_offsets[i + 1] = bytes_used + b.has_nulls = True + else: + raise IngressError( + IngressErrorCode.BadDataFrame, + f'Bad column {df_col_name!r} at row {i}: expected str, ' + f'got {_fqn(type(cell))}.') + + b.str_bytes_len = bytes_used + + # If the column turned out to be all-valid, drop the bitmap so + # the FFI takes the no-validity hot path. + if not b.has_nulls and b.validity != NULL: + free(b.validity) + b.validity = NULL + except: + pyobj_built_free(b) + raise + + return b + + +cdef pyobj_built_t* _dataframe_columnar_build_int_pyobj( + col_t* col, + size_t row_count, + object df_col_name) except NULL: + """ + Walk a PyObject int column once and produce a contiguous int64 + buffer + LSB-packed validity bitmap. Null cells leave the int64 + slot at 0 with the validity bit cleared. + + Null detection: ``None``, ``pd.NA``, and ``float('nan')`` all count + as null — the NaN-as-null rule matches the row-path behaviour + (`_dataframe_is_null_pyobj` in dataframe.pxi). A non-NaN float in + an int-sniffed column raises ``IngressError`` with the row index; + we accept the asymmetry because column-wide sniff has already + locked the source type from the first non-null cell. + """ + cdef pyobj_built_t* b = calloc(1, sizeof(pyobj_built_t)) + if b == NULL: + raise MemoryError() + b.row_count = row_count + + cdef PyObject** access = col.setup.chunks.chunks[0].buffers[1] + cdef PyObject* cell + cdef int64_t* values = NULL + cdef size_t validity_bytes = (row_count + 7) // 8 + cdef size_t i + cdef int64_t value + + try: + values = calloc(row_count if row_count > 0 else 1, + sizeof(int64_t)) + if values == NULL: + raise MemoryError() + b.data = values + if validity_bytes > 0: + b.validity = calloc(validity_bytes, sizeof(uint8_t)) + if b.validity == NULL: + raise MemoryError() + for i in range(row_count): + cell = access[i] + # PyBool_Check goes BEFORE PyLong_CheckExact because Python + # bools are subclasses of int and PyLong_CheckExact returns + # false for them; treat them as int (matches row-path). + if PyBool_Check(cell): + values[i] = 1 if cell == True else 0 + if b.validity != NULL: + _pyobj_set_validity_bit(b.validity, i) + elif PyLong_CheckExact(cell): + value = PyLong_AsLongLong(cell) + values[i] = value + if b.validity != NULL: + _pyobj_set_validity_bit(b.validity, i) + elif _dataframe_is_null_pyobj(cell): + b.has_nulls = True + else: + raise IngressError( + IngressErrorCode.BadDataFrame, + f'Bad column {df_col_name!r} at row {i}: expected int, ' + f'got {_fqn(type(cell))}.') + + if not b.has_nulls and b.validity != NULL: + free(b.validity) + b.validity = NULL + except: + pyobj_built_free(b) + raise + + return b + + +cdef pyobj_built_t* _dataframe_columnar_build_float_pyobj( + col_t* col, + size_t row_count, + object df_col_name) except NULL: + """ + Walk a PyObject float column once and produce a contiguous double + buffer + LSB-packed validity bitmap. + """ + cdef pyobj_built_t* b = calloc(1, sizeof(pyobj_built_t)) + if b == NULL: + raise MemoryError() + b.row_count = row_count + + cdef PyObject** access = col.setup.chunks.chunks[0].buffers[1] + cdef PyObject* cell + cdef double* values = NULL + cdef size_t validity_bytes = (row_count + 7) // 8 + cdef size_t i + cdef double value + + try: + values = calloc(row_count if row_count > 0 else 1, + sizeof(double)) + if values == NULL: + raise MemoryError() + b.data = values + if validity_bytes > 0: + b.validity = calloc(validity_bytes, sizeof(uint8_t)) + if b.validity == NULL: + raise MemoryError() + for i in range(row_count): + cell = access[i] + if PyFloat_CheckExact(cell): + value = PyFloat_AS_DOUBLE(cell) + if isnan(value): + # pandas NaN-as-null convention matches the row-path. + b.has_nulls = True + else: + values[i] = value + if b.validity != NULL: + _pyobj_set_validity_bit(b.validity, i) + elif PyLong_CheckExact(cell) or PyBool_Check(cell): + # Accept widening of int / bool to float, matching how + # Python implicitly converts when you do float(x). + values[i] = PyFloat_AsDouble(cell) + if b.validity != NULL: + _pyobj_set_validity_bit(b.validity, i) + elif _dataframe_is_null_pyobj(cell): + b.has_nulls = True + else: + raise IngressError( + IngressErrorCode.BadDataFrame, + f'Bad column {df_col_name!r} at row {i}: expected float, ' + f'got {_fqn(type(cell))}.') + + if not b.has_nulls and b.validity != NULL: + free(b.validity) + b.validity = NULL + except: + pyobj_built_free(b) + raise + + return b + + +cdef pyobj_built_t* _dataframe_columnar_build_bool_pyobj( + col_t* col, + size_t row_count, + object df_col_name) except NULL: + """ + Walk a PyObject bool column once and pack the values into an + Arrow LSB-first bitmap (one bit per row). Null cells are rejected — + matches the row-path behaviour (QuestDB BOOLEAN has no null + representation at the row level). + """ + cdef pyobj_built_t* b = calloc(1, sizeof(pyobj_built_t)) + if b == NULL: + raise MemoryError() + b.row_count = row_count + + cdef PyObject** access = col.setup.chunks.chunks[0].buffers[1] + cdef PyObject* cell + cdef uint8_t* bits = NULL + cdef size_t bytes = (row_count + 7) // 8 + cdef size_t i + + try: + if bytes == 0: + bytes = 1 + bits = calloc(bytes, sizeof(uint8_t)) + if bits == NULL: + raise MemoryError() + b.data = bits + for i in range(row_count): + cell = access[i] + if PyBool_Check(cell): + if cell == True: + bits[i >> 3] |= (1 << (i & 7)) + elif _dataframe_is_null_pyobj(cell): + raise IngressError( + IngressErrorCode.BadDataFrame, + f'Bad column {df_col_name!r} at row {i}: cannot insert ' + 'null into a boolean column.') + else: + raise IngressError( + IngressErrorCode.BadDataFrame, + f'Bad column {df_col_name!r} at row {i}: expected bool, ' + f'got {_fqn(type(cell))}.') + except: + pyobj_built_free(b) + raise + + return b + + +cdef pyobj_built_t* _dataframe_columnar_build_uuid_pyobj( + col_t* col, + size_t row_count, + object df_col_name) except NULL: + cdef pyobj_built_t* b = calloc(1, sizeof(pyobj_built_t)) + if b == NULL: + raise MemoryError() + b.row_count = row_count + + cdef PyObject** access = col.setup.chunks.chunks[0].buffers[1] + cdef PyObject* cell + cdef uint8_t* buf = NULL + cdef size_t buf_bytes = row_count * 16 if row_count > 0 else 16 + cdef size_t validity_bytes = (row_count + 7) // 8 + cdef size_t i + cdef object le_bytes + cdef object uuid_cls = _uuid.UUID + + try: + buf = calloc(buf_bytes, sizeof(uint8_t)) + if buf == NULL: + raise MemoryError() + b.data = buf + if validity_bytes > 0: + b.validity = calloc(validity_bytes, sizeof(uint8_t)) + if b.validity == NULL: + raise MemoryError() + for i in range(row_count): + cell = access[i] + if isinstance(cell, uuid_cls): + # `.int.to_bytes(16, 'little')` produces exactly the + # QuestDB UUID wire layout: bytes 0..8 = lo half LE, + # bytes 8..16 = hi half LE. One C-implemented call + + # one 16-byte memcpy per row. + le_bytes = (cell).int.to_bytes(16, 'little') + memcpy(buf + i * 16, PyBytes_AsString(le_bytes), 16) + if b.validity != NULL: + _pyobj_set_validity_bit(b.validity, i) + elif _dataframe_is_null_pyobj(cell): + b.has_nulls = True + else: + raise IngressError( + IngressErrorCode.BadDataFrame, + f'Bad column {df_col_name!r} at row {i}: expected UUID, ' + f'got {_fqn(type(cell))}.') + + if not b.has_nulls and b.validity != NULL: + free(b.validity) + b.validity = NULL + except: + pyobj_built_free(b) + raise + + return b + + +cdef pyobj_built_t* _dataframe_columnar_build_ipv4_pyobj( + col_t* col, + size_t row_count, + object df_col_name) except NULL: + cdef pyobj_built_t* b = calloc(1, sizeof(pyobj_built_t)) + if b == NULL: + raise MemoryError() + b.row_count = row_count + + cdef PyObject** access = col.setup.chunks.chunks[0].buffers[1] + cdef PyObject* cell + cdef uint32_t* values = NULL + cdef size_t validity_bytes = (row_count + 7) // 8 + cdef size_t i + cdef object ipv4_cls = _ipaddress.IPv4Address + + try: + values = calloc(row_count if row_count > 0 else 1, + sizeof(uint32_t)) + if values == NULL: + raise MemoryError() + b.data = values + if validity_bytes > 0: + b.validity = calloc(validity_bytes, sizeof(uint8_t)) + if b.validity == NULL: + raise MemoryError() + for i in range(row_count): + cell = access[i] + if isinstance(cell, ipv4_cls): + values[i] = int(cell) + if b.validity != NULL: + _pyobj_set_validity_bit(b.validity, i) + elif _dataframe_is_null_pyobj(cell): + b.has_nulls = True + else: + raise IngressError( + IngressErrorCode.BadDataFrame, + f'Bad column {df_col_name!r} at row {i}: expected ' + f'ipaddress.IPv4Address, got {_fqn(type(cell))}.') + + if not b.has_nulls and b.validity != NULL: + free(b.validity) + b.validity = NULL + except: + pyobj_built_free(b) + raise + + return b + + +cdef inline int64_t _days_from_civil(int y, int m, int d) noexcept nogil: + cdef int y_adj = y - 1 if m <= 2 else y + cdef int era = (y_adj if y_adj >= 0 else y_adj - 399) // 400 + cdef int yoe = y_adj - era * 400 + cdef int m_adj = m - 3 if m > 2 else m + 9 + cdef int doy = (153 * m_adj + 2) // 5 + d - 1 + cdef int doe = yoe * 365 + yoe // 4 - yoe // 100 + doy + return era * 146097 + doe - 719468 + + +cdef pyobj_built_t* _dataframe_columnar_build_datetime_pyobj( + col_t* col, + size_t row_count, + object df_col_name) except NULL: + cdef pyobj_built_t* b = calloc(1, sizeof(pyobj_built_t)) + if b == NULL: + raise MemoryError() + b.row_count = row_count + + cdef PyObject** access = col.setup.chunks.chunks[0].buffers[1] + cdef PyObject* cell + cdef int64_t* values = NULL + cdef size_t validity_bytes = (row_count + 7) // 8 + cdef size_t i + cdef object dt + cdef object epoch_aware = datetime.datetime( + 1970, 1, 1, tzinfo=datetime.timezone.utc) + cdef object datetime_cls = datetime.datetime + cdef object delta + cdef int year, month, day, hour, minute, second, us + cdef int64_t days + + try: + values = calloc(row_count if row_count > 0 else 1, + sizeof(int64_t)) + if values == NULL: + raise MemoryError() + b.data = values + if validity_bytes > 0: + b.validity = calloc(validity_bytes, sizeof(uint8_t)) + if b.validity == NULL: + raise MemoryError() + for i in range(row_count): + cell = access[i] + if isinstance(cell, datetime_cls): + dt = cell + if dt.tzinfo is None: + # Fast path: C-level field extraction + Howard + # Hinnant days_from_civil; no Python timedelta / + # int arithmetic per row. + year = PyDateTime_GET_YEAR(dt) + month = PyDateTime_GET_MONTH(dt) + day = PyDateTime_GET_DAY(dt) + hour = PyDateTime_DATE_GET_HOUR(dt) + minute = PyDateTime_DATE_GET_MINUTE(dt) + second = PyDateTime_DATE_GET_SECOND(dt) + us = PyDateTime_DATE_GET_MICROSECOND(dt) + days = _days_from_civil(year, month, day) + values[i] = ( + days * 86_400_000_000 + + hour * 3_600_000_000 + + minute * 60_000_000 + + second * 1_000_000 + + us) + else: + delta = dt - epoch_aware + values[i] = ( + delta.days * 86_400_000_000 + + delta.seconds * 1_000_000 + + delta.microseconds) + if b.validity != NULL: + _pyobj_set_validity_bit(b.validity, i) + elif _dataframe_is_null_pyobj(cell): + b.has_nulls = True + else: + raise IngressError( + IngressErrorCode.BadDataFrame, + f'Bad column {df_col_name!r} at row {i}: expected ' + f'datetime.datetime, got {_fqn(type(cell))}.') + + if not b.has_nulls and b.validity != NULL: + free(b.validity) + b.validity = NULL + except: + pyobj_built_free(b) + raise + + return b + + +cdef pyobj_built_t* _dataframe_columnar_build_bytes_pyobj( + col_t* col, + size_t row_count, + object df_col_name) except NULL: + cdef pyobj_built_t* b = calloc(1, sizeof(pyobj_built_t)) + if b == NULL: + raise MemoryError() + b.row_count = row_count + + cdef PyObject** access = col.setup.chunks.chunks[0].buffers[1] + cdef PyObject* cell + cdef Py_ssize_t blob_len + cdef const char* blob_buf + cdef size_t validity_bytes = (row_count + 7) // 8 + cdef size_t bytes_cap = 16 + cdef uint8_t* new_bytes + cdef size_t bytes_used = 0 + cdef size_t i + + try: + b.str_offsets = calloc(row_count + 1, sizeof(int32_t)) + if b.str_offsets == NULL: + raise MemoryError() + if validity_bytes > 0: + b.validity = calloc(validity_bytes, sizeof(uint8_t)) + if b.validity == NULL: + raise MemoryError() + b.str_bytes = malloc(bytes_cap) + if b.str_bytes == NULL: + raise MemoryError() + + for i in range(row_count): + cell = access[i] + if PyBytes_CheckExact(cell): + blob_len = PyBytes_GET_SIZE(cell) + blob_buf = PyBytes_AsString(cell) + if bytes_used + blob_len > 2_147_483_647: + raise IngressError( + IngressErrorCode.BadDataFrame, + f'Bad column {df_col_name!r}: column total bytes ' + 'exceeds the QWP wire binary offset table ' + 'limit (2 GiB).') + while bytes_used + blob_len > bytes_cap: + bytes_cap *= 2 + new_bytes = realloc(b.str_bytes, bytes_cap) + if new_bytes == NULL: + raise MemoryError() + b.str_bytes = new_bytes + if blob_len > 0: + memcpy(b.str_bytes + bytes_used, blob_buf, blob_len) + bytes_used += blob_len + b.str_offsets[i + 1] = bytes_used + if b.validity != NULL: + _pyobj_set_validity_bit(b.validity, i) + elif _dataframe_is_null_pyobj(cell): + b.str_offsets[i + 1] = bytes_used + b.has_nulls = True + else: + raise IngressError( + IngressErrorCode.BadDataFrame, + f'Bad column {df_col_name!r} at row {i}: expected bytes, ' + f'got {_fqn(type(cell))}.') + + b.str_bytes_len = bytes_used + if not b.has_nulls and b.validity != NULL: + free(b.validity) + b.validity = NULL + except: + pyobj_built_free(b) + raise + + return b + + +cdef void_int _dataframe_columnar_prebuild_pyobj( + object df, + dataframe_plan_t* plan) except -1: + """ + Walk every PyObject-sourced column once and stash typed buffers on + `plan.pyobj_built`. Runs after `validate_plan` and before the chunk + emission loop in `Client.dataframe()`. + """ + cdef size_t i + cdef col_t* col + cdef bint any_pyobj = False + + for i in range(plan.col_count): + col = &plan.cols.d[i] + if _is_pyobj_source(col.setup.source): + any_pyobj = True + break + if not any_pyobj: + return 0 + + plan.pyobj_built = calloc( + plan.col_count, sizeof(pyobj_built_t*)) + if plan.pyobj_built == NULL: + raise MemoryError() + + for i in range(plan.col_count): + col = &plan.cols.d[i] + if col.setup.source == col_source_t.col_source_str_pyobj: + plan.pyobj_built[i] = _dataframe_columnar_build_str_pyobj( + col, plan.row_count, df.columns[col.setup.orig_index]) + elif col.setup.source == col_source_t.col_source_int_pyobj: + plan.pyobj_built[i] = _dataframe_columnar_build_int_pyobj( + col, plan.row_count, df.columns[col.setup.orig_index]) + elif col.setup.source == col_source_t.col_source_float_pyobj: + plan.pyobj_built[i] = _dataframe_columnar_build_float_pyobj( + col, plan.row_count, df.columns[col.setup.orig_index]) + elif col.setup.source == col_source_t.col_source_bool_pyobj: + plan.pyobj_built[i] = _dataframe_columnar_build_bool_pyobj( + col, plan.row_count, df.columns[col.setup.orig_index]) + elif col.setup.source == col_source_t.col_source_uuid_pyobj: + plan.pyobj_built[i] = _dataframe_columnar_build_uuid_pyobj( + col, plan.row_count, df.columns[col.setup.orig_index]) + elif col.setup.source == col_source_t.col_source_ipv4_pyobj: + plan.pyobj_built[i] = _dataframe_columnar_build_ipv4_pyobj( + col, plan.row_count, df.columns[col.setup.orig_index]) + elif col.setup.source == col_source_t.col_source_datetime_pyobj: + plan.pyobj_built[i] = _dataframe_columnar_build_datetime_pyobj( + col, plan.row_count, df.columns[col.setup.orig_index]) + elif col.setup.source == col_source_t.col_source_bytes_pyobj: + plan.pyobj_built[i] = _dataframe_columnar_build_bytes_pyobj( + col, plan.row_count, df.columns[col.setup.orig_index]) + + +cdef void_int _dataframe_columnar_append_pyobj_str( + column_sender_chunk* chunk, + col_t* col, + pyobj_built_t* prebuilt, + size_t row_offset, + size_t row_count) except -1: + cdef line_sender_error* err = NULL + cdef column_sender_validity validity + cdef const column_sender_validity* validity_ptr = NULL + cdef bint ok = False + cdef int32_t* offsets + cdef size_t bytes_len + + if prebuilt == NULL: + raise RuntimeError( + 'PyObject str column missing pre-built buffer; ' + 'prebuild phase did not run.') + if prebuilt.has_nulls: + if row_offset % 8 != 0: + raise RuntimeError( + 'PyObject str column with nulls requires byte-aligned ' + 'chunk boundaries.') + validity.bits = prebuilt.validity + (row_offset // 8) + validity.bit_len = row_count + validity_ptr = &validity + offsets = prebuilt.str_offsets + row_offset + bytes_len = prebuilt.str_bytes_len + with nogil: + ok = column_sender_chunk_column_varchar( + chunk, + col.name.buf, + col.name.len, + offsets, + prebuilt.str_bytes, + bytes_len, + row_count, + validity_ptr, + &err) + if not ok: + raise c_err_to_py(err) + + +cdef void_int _dataframe_columnar_append_pyobj_simple( + column_sender_chunk* chunk, + col_t* col, + pyobj_built_t* prebuilt, + size_t row_offset, + size_t row_count, + size_t elem_size, + column_sender_numpy_dtype dtype) except -1: + cdef line_sender_error* err = NULL + cdef column_sender_validity validity + cdef const column_sender_validity* validity_ptr = NULL + cdef bint ok = False + + if prebuilt == NULL: + raise RuntimeError('PyObject column missing pre-built buffer.') + if prebuilt.has_nulls: + if row_offset % 8 != 0: + raise RuntimeError( + 'PyObject column with nulls requires byte-aligned ' + 'chunk boundaries.') + validity.bits = prebuilt.validity + (row_offset // 8) + validity.bit_len = row_count + validity_ptr = &validity + with nogil: + ok = column_sender_chunk_append_numpy_column( + chunk, + col.name.buf, + col.name.len, + dtype, + (prebuilt.data) + row_offset * elem_size, + row_count * elem_size, + row_count, + validity_ptr, + NULL, + &err) + if not ok: + raise c_err_to_py(err) + + +cdef void_int _dataframe_columnar_append_pyobj_bytes( + column_sender_chunk* chunk, + col_t* col, + pyobj_built_t* prebuilt, + size_t row_offset, + size_t row_count) except -1: + cdef line_sender_error* err = NULL + cdef column_sender_validity validity + cdef const column_sender_validity* validity_ptr = NULL + cdef bint ok = False + + if prebuilt == NULL: + raise RuntimeError('PyObject bytes column missing pre-built buffer.') + if prebuilt.has_nulls: + if row_offset % 8 != 0: + raise RuntimeError( + 'PyObject bytes column with nulls requires byte-aligned ' + 'chunk boundaries.') + validity.bits = prebuilt.validity + (row_offset // 8) + validity.bit_len = row_count + validity_ptr = &validity + with nogil: + ok = column_sender_chunk_column_binary( + chunk, + col.name.buf, + col.name.len, + prebuilt.str_offsets + row_offset, + prebuilt.str_bytes, + prebuilt.str_bytes_len, + row_count, + validity_ptr, + &err) + if not ok: + raise c_err_to_py(err) + + +cdef void_int _dataframe_columnar_call_arrow_append( + column_sender_chunk* chunk, + col_t* col, + size_t row_offset, + size_t row_count, + column_sender_symbol_mode symbol_mode + =column_sender_symbol_mode_auto) except -1: + cdef line_sender_error* err = NULL + cdef bint ok = False + cdef column_sender_arrow_import* imported = col.setup.arrow_import + with nogil: + if imported == NULL: + imported = column_sender_arrow_import_new( + &col.setup.chunks.chunks[0], + &col.setup.arrow_schema, + symbol_mode, + &err) + if imported != NULL: + ok = column_sender_chunk_append_arrow_import( + chunk, + col.name.buf, + col.name.len, + imported, + row_offset, + row_count, + &err) + col.setup.arrow_import = imported + if not ok: + raise c_err_to_py(err) + return 0 + + +cdef void_int _dataframe_columnar_append_field( + column_sender_chunk* chunk, + col_t* col, + pyobj_built_t* prebuilt, + size_t row_offset, + size_t row_count) except -1: + cdef line_sender_error* err = NULL + cdef ArrowArray* arr = &col.setup.chunks.chunks[0] + cdef ArrowArray* dictionary + cdef const void* data = arr.buffers[1] + cdef int32_t* offsets + cdef int32_t* dict_offsets + cdef size_t bytes_len + cdef size_t dict_offsets_len + cdef size_t dict_bytes_len + cdef column_sender_validity validity + cdef const column_sender_validity* validity_ptr = ( + _dataframe_columnar_validity(arr, row_offset, row_count, &validity)) + cdef bint ok = False + + cdef column_sender_numpy_dtype numpy_dtype + cdef size_t element_size + cdef column_sender_numpy_extras extras + cdef const column_sender_numpy_extras* extras_ptr + + if col.setup.target == col_target_t.col_target_column_bool: + if col.setup.source == col_source_t.col_source_bool_pyobj: + if prebuilt == NULL: + raise RuntimeError( + 'PyObject bool column missing pre-built bitmap.') + if row_offset % 8 != 0: + raise RuntimeError( + 'PyObject bool column requires byte-aligned chunk boundaries.') + with nogil: + ok = column_sender_chunk_column_bool( + chunk, + col.name.buf, + col.name.len, + (prebuilt.data) + (row_offset // 8), + row_count, + NULL, + &err) + elif col.setup.source == col_source_t.col_source_bool_numpy: + # NumPy bool is byte-per-row; Rust packs to LSB-bitmap + # inside column_sender_chunk_append_numpy_column. + with nogil: + ok = column_sender_chunk_append_numpy_column( + chunk, + col.name.buf, + col.name.len, + column_sender_numpy_dtype.column_sender_numpy_bool, + (data) + row_offset, + row_count, + row_count, + validity_ptr, + NULL, + &err) + else: + raise RuntimeError('Unsupported columnar bool source.') + elif col.setup.target == col_target_t.col_target_column_i64: + if col.setup.source == col_source_t.col_source_int_pyobj: + if prebuilt == NULL: + raise RuntimeError( + 'PyObject int column missing pre-built buffer.') + if prebuilt.has_nulls and row_offset % 8 != 0: + raise RuntimeError( + 'PyObject int column with nulls requires byte-aligned ' + 'chunk boundaries.') + if prebuilt.has_nulls: + validity.bits = prebuilt.validity + (row_offset // 8) + validity.bit_len = row_count + validity_ptr = &validity + else: + validity_ptr = NULL + with nogil: + ok = column_sender_chunk_append_numpy_column( + chunk, + col.name.buf, + col.name.len, + column_sender_numpy_dtype.column_sender_numpy_i64, + (prebuilt.data) + row_offset * 8, + row_count * 8, + row_count, + validity_ptr, + NULL, + &err) + else: + # Rust widens narrow ints to a sentinel-safe wire (i8/i16 → INT, + # i32/u32/u64 → LONG); see questdb-rs NumpyDtype::*WidenTo*. + if col.setup.source in ( + col_source_t.col_source_i64_numpy, + col_source_t.col_source_i64_arrow): + numpy_dtype = column_sender_numpy_dtype.column_sender_numpy_i64 + element_size = 8 + elif col.setup.source == col_source_t.col_source_i8_numpy: + numpy_dtype = column_sender_numpy_dtype.column_sender_numpy_i8 + element_size = 1 + elif col.setup.source == col_source_t.col_source_i16_numpy: + numpy_dtype = column_sender_numpy_dtype.column_sender_numpy_i16 + element_size = 2 + elif col.setup.source == col_source_t.col_source_i32_numpy: + numpy_dtype = column_sender_numpy_dtype.column_sender_numpy_i32 + element_size = 4 + elif col.setup.source == col_source_t.col_source_u8_numpy: + numpy_dtype = column_sender_numpy_dtype.column_sender_numpy_u8 + element_size = 1 + elif col.setup.source == col_source_t.col_source_u16_numpy: + numpy_dtype = column_sender_numpy_dtype.column_sender_numpy_u16 + element_size = 2 + elif col.setup.source in ( + col_source_t.col_source_u32_numpy, + col_source_t.col_source_u32_arrow): + numpy_dtype = column_sender_numpy_dtype.column_sender_numpy_u32 + element_size = 4 + elif col.setup.source == col_source_t.col_source_u64_numpy: + numpy_dtype = column_sender_numpy_dtype.column_sender_numpy_u64 + element_size = 8 + else: + raise RuntimeError('Unsupported columnar int source.') + extras_ptr = NULL + if col.setup.has_override: + numpy_dtype = col.setup.override_dtype + if (numpy_dtype + == column_sender_numpy_dtype.column_sender_numpy_geohash_i8 + or numpy_dtype + == column_sender_numpy_dtype.column_sender_numpy_geohash_i16 + or numpy_dtype + == column_sender_numpy_dtype.column_sender_numpy_geohash_i32 + or numpy_dtype + == column_sender_numpy_dtype.column_sender_numpy_geohash_i64): + memset(&extras, 0, sizeof(column_sender_numpy_extras)) + extras.geohash_bits = col.setup.override_geohash_bits + extras_ptr = &extras + with nogil: + ok = column_sender_chunk_append_numpy_column( + chunk, + col.name.buf, + col.name.len, + numpy_dtype, + (data) + row_offset * element_size, + row_count * element_size, + row_count, + validity_ptr, + extras_ptr, + &err) + elif col.setup.target == col_target_t.col_target_column_f64: + if col.setup.source in ( + col_source_t.col_source_f64_numpy, + col_source_t.col_source_f64_arrow): + numpy_dtype = column_sender_numpy_dtype.column_sender_numpy_f64 + element_size = 8 + with nogil: + ok = column_sender_chunk_append_numpy_column( + chunk, + col.name.buf, + col.name.len, + numpy_dtype, + (data) + row_offset * element_size, + row_count * element_size, + row_count, + validity_ptr, + NULL, + &err) + elif col.setup.source == col_source_t.col_source_f32_numpy: + # Rust emits FLOAT wire for numpy f32; server widens to DOUBLE + # column if needed. + numpy_dtype = column_sender_numpy_dtype.column_sender_numpy_f32 + element_size = 4 + with nogil: + ok = column_sender_chunk_append_numpy_column( + chunk, + col.name.buf, + col.name.len, + numpy_dtype, + (data) + row_offset * element_size, + row_count * element_size, + row_count, + validity_ptr, + NULL, + &err) + elif col.setup.source == col_source_t.col_source_float_pyobj: + if prebuilt == NULL: + raise RuntimeError( + 'PyObject float column missing pre-built buffer.') + if prebuilt.has_nulls and row_offset % 8 != 0: + raise RuntimeError( + 'PyObject float column with nulls requires byte-aligned ' + 'chunk boundaries.') + if prebuilt.has_nulls: + validity.bits = prebuilt.validity + (row_offset // 8) + validity.bit_len = row_count + validity_ptr = &validity + else: + validity_ptr = NULL + with nogil: + ok = column_sender_chunk_append_numpy_column( + chunk, + col.name.buf, + col.name.len, + column_sender_numpy_dtype.column_sender_numpy_f64, + (prebuilt.data) + row_offset * 8, + row_count * 8, + row_count, + validity_ptr, + NULL, + &err) + else: + raise RuntimeError('Unsupported columnar float source.') + elif col.setup.target == col_target_t.col_target_column_ts: + if col.setup.source == col_source_t.col_source_dt64ns_numpy: + with nogil: + ok = column_sender_chunk_append_numpy_column( + chunk, + col.name.buf, + col.name.len, + column_sender_numpy_dtype.column_sender_numpy_datetime64_ns, + (data) + row_offset * 8, + row_count * 8, + row_count, + validity_ptr, + NULL, + &err) + elif col.setup.source == col_source_t.col_source_dt64us_numpy: + with nogil: + ok = column_sender_chunk_append_numpy_column( + chunk, + col.name.buf, + col.name.len, + column_sender_numpy_dtype.column_sender_numpy_datetime64_us, + (data) + row_offset * 8, + row_count * 8, + row_count, + validity_ptr, + NULL, + &err) + elif col.setup.source in ( + col_source_t.col_source_dt64ns_tz_arrow, + col_source_t.col_source_dt64us_tz_arrow): + _dataframe_columnar_call_arrow_append( + chunk, col, row_offset, row_count) + return 0 + elif col.setup.source == col_source_t.col_source_datetime_pyobj: + _dataframe_columnar_append_pyobj_simple( + chunk, col, prebuilt, row_offset, row_count, 8, + column_sender_numpy_dtype.column_sender_numpy_datetime64_us) + return 0 + else: + raise RuntimeError('Unsupported columnar timestamp field source.') + elif col.setup.target in ( + col_target_t.col_target_column_i8, + col_target_t.col_target_column_i16, + col_target_t.col_target_column_i32, + col_target_t.col_target_column_f32, + col_target_t.col_target_column_long256): + _dataframe_columnar_call_arrow_append( + chunk, col, row_offset, row_count) + return 0 + elif col.setup.target == col_target_t.col_target_column_uuid: + if col.setup.source == col_source_t.col_source_uuid_pyobj: + _dataframe_columnar_append_pyobj_simple( + chunk, col, prebuilt, row_offset, row_count, 16, + column_sender_numpy_dtype.column_sender_numpy_s16) + return 0 + _dataframe_columnar_call_arrow_append( + chunk, col, row_offset, row_count) + return 0 + elif col.setup.target == col_target_t.col_target_column_ipv4: + if col.setup.source == col_source_t.col_source_ipv4_pyobj: + _dataframe_columnar_append_pyobj_simple( + chunk, col, prebuilt, row_offset, row_count, 4, + column_sender_numpy_dtype.column_sender_numpy_u32_ipv4) + return 0 + _dataframe_columnar_call_arrow_append( + chunk, col, row_offset, row_count) + return 0 + elif col.setup.target == col_target_t.col_target_column_binary: + if col.setup.source == col_source_t.col_source_bytes_pyobj: + _dataframe_columnar_append_pyobj_bytes( + chunk, col, prebuilt, row_offset, row_count) + return 0 + raise RuntimeError('Unsupported columnar binary field source.') + elif col.setup.target == col_target_t.col_target_column_str: + if col.setup.source == col_source_t.col_source_str_pyobj: + _dataframe_columnar_append_pyobj_str( + chunk, col, prebuilt, row_offset, row_count) + return 0 # err already raised inside on failure + if col.setup.source in ( + col_source_t.col_source_str_i8_cat, + col_source_t.col_source_str_i16_cat, + col_source_t.col_source_str_i32_cat): + _dataframe_columnar_call_arrow_append( + chunk, col, row_offset, row_count, + column_sender_symbol_mode_not_symbol) + return 0 + _dataframe_columnar_call_arrow_append( + chunk, col, row_offset, row_count) + return 0 + elif col.setup.target == col_target_t.col_target_symbol: + _dataframe_columnar_call_arrow_append( + chunk, col, row_offset, row_count, + column_sender_symbol_mode_symbol) + return 0 + elif col.setup.target == col_target_t.col_target_column_arrow: + _dataframe_columnar_call_arrow_append( + chunk, col, row_offset, row_count) + return 0 + else: + raise RuntimeError('Unsupported columnar field target.') + + if not ok: + raise c_err_to_py(err) + + +cdef void_int _dataframe_columnar_append_at( + column_sender_chunk* chunk, + col_t* col, + pyobj_built_t* prebuilt, + size_t row_offset, + size_t row_count) except -1: + cdef line_sender_error* err = NULL + cdef const int64_t* data + cdef bint ok = False + + if col.setup.source == col_source_t.col_source_datetime_pyobj: + if prebuilt == NULL: + raise RuntimeError( + 'PyObject datetime designated TS missing pre-built buffer.') + if prebuilt.has_nulls: + raise IngressError( + IngressErrorCode.BadDataFrame, + 'Designated timestamp column cannot contain nulls.') + data = prebuilt.data + with nogil: + ok = column_sender_chunk_designated_timestamp_micros( + chunk, + data + row_offset, + row_count, + &err) + if not ok: + raise c_err_to_py(err) + return 0 + + data = (col.setup.chunks.chunks[0].buffers[1]) + + if col.setup.source in ( + col_source_t.col_source_dt64ns_numpy, + col_source_t.col_source_dt64ns_tz_arrow): + with nogil: + ok = column_sender_chunk_designated_timestamp_nanos( + chunk, + data + row_offset, + row_count, + &err) + elif col.setup.source in ( + col_source_t.col_source_dt64us_numpy, + col_source_t.col_source_dt64us_tz_arrow): + with nogil: + ok = column_sender_chunk_designated_timestamp_micros( + chunk, + data + row_offset, + row_count, + &err) + elif col.setup.source == col_source_t.col_source_dt64ms_tz_arrow: + with nogil: + ok = column_sender_chunk_designated_timestamp_millis( + chunk, + data + row_offset, + row_count, + &err) + elif col.setup.source == col_source_t.col_source_dt64s_tz_arrow: + with nogil: + ok = column_sender_chunk_designated_timestamp_seconds( + chunk, + data + row_offset, + row_count, + &err) + else: + raise RuntimeError('Unsupported columnar designated timestamp source.') + + if not ok: + raise c_err_to_py(err) + + +cdef int _geohash_override_dtype(col_source_t source) noexcept: + if source == col_source_t.col_source_i8_numpy: + return column_sender_numpy_dtype.column_sender_numpy_geohash_i8 + if source == col_source_t.col_source_i16_numpy: + return column_sender_numpy_dtype.column_sender_numpy_geohash_i16 + if source == col_source_t.col_source_i32_numpy: + return column_sender_numpy_dtype.column_sender_numpy_geohash_i32 + if source == col_source_t.col_source_i64_numpy: + return column_sender_numpy_dtype.column_sender_numpy_geohash_i64 + return -1 + + +cdef object _dataframe_normalize_nullable(object df): + if not _is_pandas_dataframe_object(df): + return df + _dataframe_may_import_deps() + cdef object masked_base = _pandas_masked_dtype() + convert = [] + for name, dtype in zip(df.columns, df.dtypes): + # pyarrow-backed strings keep their Arrow buffers (resolved as + # str_utf8_arrow), so an all-null column survives as a null VARCHAR + # instead of collapsing to a skipped all-null object column. + if isinstance(dtype, masked_base): + convert.append(name) + elif (isinstance(dtype, _PANDAS.StringDtype) + and getattr(dtype, 'storage', None) != 'pyarrow'): + convert.append(name) + if not convert: + return df + out = df.copy(deep=False) + for name in convert: + out[name] = df[name].astype(object) + out.attrs = dict(df.attrs) + return out + + +cdef object _dataframe_normalize_at_timestamp(object df, object at): + # tz-aware (DatetimeTZ) ms/s designated-`at` columns can't reach the + # columnar resolver's source override (the shared classifier rejects + # non-ns/us tz units first), so widen them to us here. ArrowDtype ms/s + # is widened to micros in Rust by the millis/seconds designated-ts FFI. + cdef object dtype, new_dtype, out + if not isinstance(at, str) or not _is_pandas_dataframe_object(df): + return df + _dataframe_may_import_deps() + try: + if at not in df.columns: + return df + dtype = df[at].dtype + except Exception: + return df + if not isinstance(dtype, _PANDAS.DatetimeTZDtype) or dtype.unit not in ('s', 'ms'): + return df + new_dtype = _PANDAS.DatetimeTZDtype('us', dtype.tz) + out = df.copy(deep=False) + out[at] = df[at].astype(new_dtype) + out.attrs = dict(df.attrs) + return out + + +cdef void_int _dataframe_apply_roundtrip_overrides( + object df, dataframe_plan_t* plan) except -1: + cdef size_t col_index + cdef col_t* col + cdef int gh + for col_index in range(plan.col_count): + plan.cols.d[col_index].setup.has_override = False + attrs = getattr(df, 'attrs', None) + if not attrs: + return 0 + qmeta = attrs.get('questdb') + if not qmeta: + return 0 + cols_meta = qmeta.get('columns') + if not cols_meta: + return 0 + df_cols = list(df.columns) + for col_index in range(plan.col_count): + col = &plan.cols.d[col_index] + if col.setup.orig_index >= len(df_cols): + continue + meta = cols_meta.get(df_cols[col.setup.orig_index]) + if not meta: + continue + kind = meta.get('kind') + if (kind == 'ipv4' + and col.setup.source == col_source_t.col_source_u32_numpy): + col.setup.has_override = True + col.setup.override_dtype = \ + column_sender_numpy_dtype.column_sender_numpy_u32_ipv4 + elif (kind == 'char' + and col.setup.source == col_source_t.col_source_u16_numpy): + col.setup.has_override = True + col.setup.override_dtype = \ + column_sender_numpy_dtype.column_sender_numpy_u16_char + elif kind == 'geohash': + gh = _geohash_override_dtype(col.setup.source) + bits = meta.get('precision_bits') or 0 + if gh != -1 and 1 <= bits <= 60: + col.setup.has_override = True + col.setup.override_dtype = gh + col.setup.override_geohash_bits = bits + return 0 + + +cdef void_int _dataframe_columnar_populate_chunk( + dataframe_plan_t* plan, + column_sender_chunk* chunk, + size_t row_offset, + size_t row_count) except -1: + cdef size_t col_index + cdef col_t* col + cdef col_t* at_col = NULL + cdef size_t at_col_index = 0 + cdef size_t field_count = 0 + cdef pyobj_built_t* prebuilt = NULL + cdef pyobj_built_t* at_prebuilt = NULL + + for col_index in range(plan.col_count): + col = &plan.cols.d[col_index] + if col.setup.target == col_target_t.col_target_at: + at_col = col + at_col_index = col_index + elif col.setup.target in ( + col_target_t.col_target_column_bool, + col_target_t.col_target_column_i64, + col_target_t.col_target_column_f64, + col_target_t.col_target_column_ts, + col_target_t.col_target_column_str, + col_target_t.col_target_symbol, + col_target_t.col_target_column_i8, + col_target_t.col_target_column_i16, + col_target_t.col_target_column_i32, + col_target_t.col_target_column_f32, + col_target_t.col_target_column_uuid, + col_target_t.col_target_column_long256, + col_target_t.col_target_column_ipv4, + col_target_t.col_target_column_binary, + col_target_t.col_target_column_arrow): + if plan.pyobj_built != NULL: + prebuilt = plan.pyobj_built[col_index] + else: + prebuilt = NULL + _dataframe_columnar_append_field( + chunk, col, prebuilt, row_offset, row_count) + field_count += 1 + + if field_count == 0: + raise RuntimeError( + 'Validated columnar plan has no non-timestamp data columns.') + if at_col == NULL: + raise RuntimeError('Validated columnar plan has no timestamp column.') + if plan.pyobj_built != NULL: + at_prebuilt = plan.pyobj_built[at_col_index] + _dataframe_columnar_append_at( + chunk, at_col, at_prebuilt, row_offset, row_count) + + +cdef void_int _dataframe_columnar_sync(column_sender* conn) except -1: + cdef line_sender_error* err = NULL + cdef bint ok = False + cdef PyThreadState* gs = NULL + cdef uint64_t start_ns = 0 + global _dataframe_columnar_sync_calls + global _dataframe_columnar_sync_ns + if _dataframe_columnar_count_io_stats: + start_ns = time.perf_counter_ns() + _ensure_doesnt_have_gil(&gs) + ok = column_sender_sync( + conn, + column_sender_ack_level.column_sender_ack_level_ok, + &err) + _ensure_has_gil(&gs) + if _dataframe_columnar_count_io_stats: + _dataframe_columnar_sync_calls += 1 + _dataframe_columnar_sync_ns += time.perf_counter_ns() - start_ns + if not ok: + raise c_err_to_py(err) + + +cdef bint _dataframe_columnar_force_drop_after_error( + column_sender* conn, + bint flushed, + bint flush_attempted, + bint sync_attempted) noexcept: + # Exceptions during a dataframe publish can leave in-flight deferred + # frames on the connection. If rows were flushed and the closing sync was + # not attempted yet, one defensive sync can make the connection reusable. + # Otherwise the connection only needs dropping when the sender latched it + # terminal: a validation/capacity failure writes no bytes and leaves the + # pooled connection reusable. + if conn == NULL: + return False + if not flush_attempted: + return column_sender_must_close(conn) + if flushed and not sync_attempted and not column_sender_must_close(conn): + try: + _dataframe_columnar_sync(conn) + return False + except Exception: + pass + return column_sender_must_close(conn) + + +cdef bint _dataframe_columnar_is_deferred_capacity_error( + line_sender_error* err) noexcept: + cdef size_t msg_len = 0 + cdef const char* msg = line_sender_error_msg(err, &msg_len) + if msg_len < 47: + return False + return strncmp( + msg, + "column sender deferred flush capacity exhausted", + 47) == 0 + + +cdef void_int _dataframe_columnar_flush( + column_sender* conn, + column_sender_chunk* chunk, + bint retry_after_sync) except -1: + cdef line_sender_error* err = NULL + cdef line_sender_error_code err_code + cdef bint ok = False + cdef PyThreadState* gs = NULL + cdef uint64_t start_ns = 0 + global _dataframe_columnar_flush_calls + global _dataframe_columnar_flush_ns + global _dataframe_columnar_flush_retry_syncs + + if _dataframe_columnar_count_io_stats: + start_ns = time.perf_counter_ns() + _ensure_doesnt_have_gil(&gs) + ok = column_sender_flush(conn, chunk, &err) + _ensure_has_gil(&gs) + if _dataframe_columnar_count_io_stats: + _dataframe_columnar_flush_calls += 1 + _dataframe_columnar_flush_ns += time.perf_counter_ns() - start_ns + if ok: + return 0 + + err_code = line_sender_error_get_code(err) + if (retry_after_sync and err_code == line_sender_error_invalid_api_call and + _dataframe_columnar_is_deferred_capacity_error(err)): + if _dataframe_columnar_count_io_stats: + _dataframe_columnar_flush_retry_syncs += 1 + line_sender_error_free(err) + err = NULL + _dataframe_columnar_sync(conn) + if _dataframe_columnar_count_io_stats: + start_ns = time.perf_counter_ns() + _ensure_doesnt_have_gil(&gs) + ok = column_sender_flush(conn, chunk, &err) + _ensure_has_gil(&gs) + if _dataframe_columnar_count_io_stats: + _dataframe_columnar_flush_calls += 1 + _dataframe_columnar_flush_ns += time.perf_counter_ns() - start_ns + if ok: + return 0 + + raise c_err_to_py(err) + + +cdef void_int _dataframe_arrow_flush_batch( + column_sender* conn, + line_sender_table_name table, + ArrowArray* array, + ArrowSchema* schema, + line_sender_column_name* ts_column, + const column_sender_arrow_override* overrides, + size_t overrides_len) except -1: + cdef line_sender_error* err = NULL + cdef bint ok = False + cdef PyThreadState* gs = NULL + cdef uint64_t start_ns = 0 + global _dataframe_columnar_flush_calls + global _dataframe_columnar_flush_ns + + if _dataframe_columnar_count_io_stats: + start_ns = time.perf_counter_ns() + _ensure_doesnt_have_gil(&gs) + if ts_column != NULL: + ok = column_sender_flush_arrow_batch_at_column( + conn, table, array, schema, ts_column[0], + overrides, overrides_len, &err) + else: + ok = column_sender_flush_arrow_batch_server_stamped( + conn, table, array, schema, + overrides, overrides_len, &err) + _ensure_has_gil(&gs) + if _dataframe_columnar_count_io_stats: + _dataframe_columnar_flush_calls += 1 + _dataframe_columnar_flush_ns += time.perf_counter_ns() - start_ns + if not ok: + raise c_err_to_py(err) + return 0 + + +def _debug_dataframe_columnar_io_stats( + object enabled=None, + bint reset=False): + """ + Internal benchmark hook for columnar flush/sync timing. + """ + global _dataframe_columnar_count_io_stats + global _dataframe_columnar_flush_calls + global _dataframe_columnar_flush_ns + global _dataframe_columnar_sync_calls + global _dataframe_columnar_sync_ns + global _dataframe_columnar_flush_retry_syncs + + if reset: + _dataframe_columnar_flush_calls = 0 + _dataframe_columnar_flush_ns = 0 + _dataframe_columnar_sync_calls = 0 + _dataframe_columnar_sync_ns = 0 + _dataframe_columnar_flush_retry_syncs = 0 + if enabled is not None: + _dataframe_columnar_count_io_stats = bool(enabled) + return { + 'enabled': _dataframe_columnar_count_io_stats, + 'flush_calls': _dataframe_columnar_flush_calls, + 'flush_s': _dataframe_columnar_flush_ns / 1_000_000_000.0, + 'sync_calls': _dataframe_columnar_sync_calls, + 'sync_s': _dataframe_columnar_sync_ns / 1_000_000_000.0, + 'flush_retry_syncs': _dataframe_columnar_flush_retry_syncs, + } + + +def _debug_dataframe_columnar_plan( + object df, + *, + object table_name=None, + object table_name_col=None, + object symbols='auto', + object at=None): + cdef qdb_pystr_buf* b = qdb_pystr_buf_new() + cdef dataframe_plan_t plan = dataframe_plan_blank() + cdef object failures + try: + _dataframe_plan_build( + b, + df, + table_name, + table_name_col, + symbols, + at, + &plan, + _FIELD_TARGETS_QWP) + failures = _dataframe_columnar_plan_failures(df, &plan) + return { + 'supported': not bool(failures), + 'failures': failures, + 'normalizations': _dataframe_columnar_plan_normalizations( + df, + &plan), + } + finally: + dataframe_plan_release(&plan) + qdb_pystr_buf_free(b) + + +def _bench_dataframe_flush_arrow_batch( + object arrow_source, + *, + object table_name=None, + object at=None, + object conf=None, + size_t iterations=1): + """ + Internal benchmark hook for `column_sender_flush_arrow_batch_server_stamped` + FFI. + + `arrow_source` must expose the Arrow PyCapsule Interface + (`__arrow_c_stream__`) — pa.RecordBatch, pa.Table, pa.RecordBatchReader, + pl.DataFrame, or any other Arrow-native container. Pandas frames are + not accepted here on purpose: this hook benches the Arrow FFI itself, + not pandas→Arrow conversion. Use `_bench_dataframe_plan_and_populate_ + column_chunks` for the pandas chunk-based path. Intentionally kept out + of `__all__`. + """ + cdef size_t iteration + cdef size_t row_count = 0 + cdef size_t col_count = 0 + cdef size_t completed = 0 + cdef questdb_db* db = NULL + cdef column_sender* conn = NULL + cdef line_sender_error* err = NULL + cdef qdb_pystr_buf* b = NULL + cdef PyThreadState* gs = NULL + cdef bytes conf_bytes + cdef bint any_flushed = False + cdef bint flush_attempted = False + cdef size_t deferred_since_sync = 0 + cdef line_sender_table_name c_table_name + cdef line_sender_column_name c_ts_column + cdef line_sender_column_name* c_ts_column_ptr = NULL + cdef ArrowSchema c_schema + cdef bint at_is_column = False + + if iterations == 0: + raise ValueError('iterations must be greater than zero') + if conf is None: + raise ValueError('conf is required for flush_arrow_batch bench.') + if not hasattr(arrow_source, '__arrow_c_stream__'): + raise TypeError( + '_bench_dataframe_flush_arrow_batch requires an Arrow-native ' + 'source exposing __arrow_c_stream__ ' + '(pa.RecordBatch / pa.Table / pl.DataFrame / RecordBatchReader). ' + f'Got {type(arrow_source).__name__}.') + if not isinstance(table_name, str): + raise TypeError( + 'table_name must be str for Arrow-native DataFrame input.') + if at is None or isinstance(at, ServerTimestampType): + at_is_column = False + elif isinstance(at, str): + at_is_column = True + else: + raise TypeError( + 'at must be a column name str, ServerTimestamp, or None ' + 'for Arrow-native DataFrame input.') + + row_count = int( + getattr(arrow_source, 'num_rows', None) + or getattr(arrow_source, 'height', None) + or 0) + col_count = int( + getattr(arrow_source, 'num_columns', None) + or getattr(arrow_source, 'width', None) + or 0) + + conf_bytes = conf.encode('utf-8') if isinstance(conf, str) else conf + _ensure_doesnt_have_gil(&gs) + db = questdb_db_connect(conf_bytes, len(conf_bytes), &err) + _ensure_has_gil(&gs) + if db == NULL: + raise c_err_to_py(err) + b = qdb_pystr_buf_new() + memset(&c_schema, 0, sizeof(ArrowSchema)) + try: + str_to_table_name(b, table_name, &c_table_name) + if at_is_column: + str_to_column_name(b, at, &c_ts_column) + c_ts_column_ptr = &c_ts_column + + _ensure_doesnt_have_gil(&gs) + conn = questdb_db_borrow_column_sender(db, &err) + _ensure_has_gil(&gs) + if conn == NULL: + raise c_err_to_py(err) + try: + for iteration in range(iterations): + _capsule_consume_stream( + conn, arrow_source, c_table_name, c_ts_column_ptr, + &c_schema, NULL, 0, &any_flushed, &flush_attempted, + &deferred_since_sync) + _dataframe_columnar_sync(conn) + completed = iterations + finally: + questdb_db_return_column_sender(db, conn) + finally: + if c_schema.release != NULL: + c_schema.release(&c_schema) + if b != NULL: + qdb_pystr_buf_free(b) + if db != NULL: + questdb_db_close(db) + + return { + 'iterations': iterations, + 'row_count': row_count, + 'col_count': col_count, + 'logical_cells': row_count * col_count, + 'completed': completed, + } + + +def _bench_dataframe_plan_and_populate_column_chunks( + object df, + *, + object table_name=None, + object table_name_col=None, + object symbols='auto', + object at=None, + size_t iterations=1, + size_t max_rows_per_chunk=16384): + """ + Internal benchmark hook for Layer 1 pandas columnar work. + + This builds the shared dataframe plan and populates #148 chunks, but it + never flushes to a sender. It is intentionally kept out of ``__all__``. + """ + cdef size_t iteration + cdef qdb_pystr_buf* b = NULL + cdef dataframe_plan_t plan + cdef column_sender_chunk* chunk = NULL + cdef line_sender_error* err = NULL + cdef uint64_t start_row_path_emissions + cdef uint64_t end_row_path_emissions + cdef size_t row_count = 0 + cdef size_t col_count = 0 + cdef size_t populated_rows = 0 + cdef size_t populated_rows_total = 0 + cdef size_t populated_chunks = 0 + cdef size_t rows_per_chunk = 0 + cdef size_t row_offset + cdef size_t chunk_rows + global _dataframe_count_row_path_emissions + global _dataframe_row_path_emissions + + if iterations == 0: + raise ValueError('iterations must be greater than zero') + + start_row_path_emissions = _dataframe_row_path_emissions + _dataframe_count_row_path_emissions = True + try: + for iteration in range(iterations): + b = qdb_pystr_buf_new() + plan = dataframe_plan_blank() + try: + _dataframe_plan_build( + b, + df, + table_name, + table_name_col, + symbols, + at, + &plan, + _FIELD_TARGETS_QWP) + row_count = plan.row_count + col_count = plan.col_count + if (plan.col_count == 0) or (plan.row_count == 0): + continue + + _dataframe_columnar_validate_plan(df, &plan) + _dataframe_columnar_prebuild_pyobj(df, &plan) + rows_per_chunk = _dataframe_columnar_rows_per_chunk( + &plan, + max_rows_per_chunk) + chunk = column_sender_chunk_new( + plan.c_table_name.buf, + plan.c_table_name.len, + &err) + if chunk == NULL: + raise c_err_to_py(err) + row_offset = 0 + while row_offset < plan.row_count: + if not column_sender_chunk_clear(chunk, &err): + raise c_err_to_py(err) + chunk_rows = rows_per_chunk + if chunk_rows > plan.row_count - row_offset: + chunk_rows = plan.row_count - row_offset + _dataframe_columnar_populate_chunk( + &plan, + chunk, + row_offset, + chunk_rows) + populated_rows = column_sender_chunk_row_count(chunk, &err) + if populated_rows == -1: + raise c_err_to_py(err) + if populated_rows != 0: + populated_chunks += 1 + populated_rows_total += populated_rows + row_offset += chunk_rows + finally: + if chunk != NULL: + column_sender_chunk_free(chunk) + chunk = NULL + dataframe_plan_release(&plan) + if b != NULL: + qdb_pystr_buf_free(b) + b = NULL + finally: + _dataframe_count_row_path_emissions = False + + end_row_path_emissions = _dataframe_row_path_emissions + return { + 'iterations': iterations, + 'row_count': row_count, + 'col_count': col_count, + 'logical_cells': row_count * col_count, + 'rows_per_chunk': rows_per_chunk, + 'populated_chunks': populated_chunks, + 'populated_rows_total': populated_rows_total, + 'last_populated_rows': populated_rows, + 'row_path_cell_emissions': ( + end_row_path_emissions - start_row_path_emissions), + } + + +cdef object _POLARS = None +cdef object _POLARS_DATAFRAME_T = None +cdef object _POLARS_LAZYFRAME_T = None + + +cdef bint _try_import_polars(): + global _POLARS, _POLARS_DATAFRAME_T, _POLARS_LAZYFRAME_T + if _POLARS is not None: + return True + try: + import polars + except ImportError: + return False + _POLARS = polars + _POLARS_DATAFRAME_T = polars.DataFrame + _POLARS_LAZYFRAME_T = polars.LazyFrame + return True + + +cdef bint _is_polars_dataframe_or_lazy(object obj): + if not _try_import_polars(): + return False + return isinstance(obj, (_POLARS_DATAFRAME_T, _POLARS_LAZYFRAME_T)) + + +cdef void_int _capsule_consume_stream( + column_sender* conn, + object stream_owner, + line_sender_table_name c_table_name, + line_sender_column_name* c_ts_column_ptr, + ArrowSchema* c_schema, + const column_sender_arrow_override* c_overrides, + size_t c_overrides_len, + bint* any_flushed, + bint* flush_attempted, + size_t* deferred_since_sync) except -1: + # `c_schema` is in/out and owned by the caller: zero-init on first + # call (this function populates it via get_schema), reused as-is on + # subsequent calls (Arrow C Data Interface guarantees slices of the + # same source share schema), and released by the caller. + cdef object stream_capsule = stream_owner.__arrow_c_stream__() + if not PyCapsule_IsValid(stream_capsule, b'arrow_array_stream'): + raise TypeError( + '__arrow_c_stream__ did not return a valid arrow_array_stream ' + 'PyCapsule.') + cdef ArrowArrayStream* stream = PyCapsule_GetPointer( + stream_capsule, b'arrow_array_stream') + cdef ArrowArray batch + cdef int rc + cdef const char* stream_err + + if c_schema.release == NULL: + rc = stream.get_schema(stream, c_schema) + if rc != 0: + stream_err = stream.get_last_error(stream) + raise IngressError( + IngressErrorCode.InvalidApiCall, + f'Arrow stream get_schema failed: ' + f'{stream_err.decode("utf-8", errors="replace") if stream_err != NULL else "unknown"}') + + while True: + memset(&batch, 0, sizeof(ArrowArray)) + rc = stream.get_next(stream, &batch) + if rc != 0: + stream_err = stream.get_last_error(stream) + raise IngressError( + IngressErrorCode.InvalidApiCall, + f'Arrow stream get_next failed: ' + f'{stream_err.decode("utf-8", errors="replace") if stream_err != NULL else "unknown"}') + if batch.release == NULL: + break + try: + flush_attempted[0] = True + if deferred_since_sync[0] >= _QWP_MAX_DEFERRED_ARROW_FRAMES: + _dataframe_columnar_sync(conn) + deferred_since_sync[0] = 0 + _dataframe_arrow_flush_batch( + conn, c_table_name, &batch, c_schema, c_ts_column_ptr, + c_overrides, c_overrides_len) + any_flushed[0] = True + deferred_since_sync[0] += 1 + finally: + if batch.release != NULL: + batch.release(&batch) + + +cdef object _validate_schema_overrides(object schema_overrides): + """Convert the public schema_overrides dict into a list of + (name_bytes, kind_int, arg_int) tuples. Returns None if empty. + + Keeping `name_bytes` alive on the Python side lets the C overrides + array borrow the underlying char* without an extra copy. + """ + if not schema_overrides: + return None + if not isinstance(schema_overrides, dict): + raise TypeError( + 'schema_overrides must be a dict mapping column name to ' + "one of: 'symbol', 'ipv4', 'char', or ('geohash', bits).") + cdef list out = [] + cdef object name, override, kind, value + cdef int kind_int + cdef int arg_int + for name, override in schema_overrides.items(): + if not isinstance(name, str): + raise TypeError( + f'schema_overrides key must be str, got ' + f'{type(name).__name__}.') + if isinstance(override, str): + kind = override + value = None + elif isinstance(override, tuple) and len(override) == 2: + kind, value = override + else: + raise TypeError( + f'schema_overrides[{name!r}] has invalid shape ' + f'{override!r}; expected str or (kind, value) tuple.') + arg_int = 0 + if kind == 'symbol': + kind_int = column_sender_arrow_override_symbol + elif kind == 'ipv4': + kind_int = column_sender_arrow_override_ipv4 + elif kind == 'char': + kind_int = column_sender_arrow_override_char + elif kind == 'geohash': + if not isinstance(value, int) or value < 1 or value > 60: + raise ValueError( + f'schema_overrides[{name!r}] geohash bits must ' + f'be int in 1..=60, got {value!r}.') + kind_int = column_sender_arrow_override_geohash + arg_int = value + else: + raise ValueError( + f'schema_overrides[{name!r}] kind {kind!r} not ' + "in {'symbol', 'ipv4', 'char', 'geohash'}.") + out.append((name.encode('utf-8'), kind_int, arg_int)) + return out + + +cdef object _capsule_get_column_names(object sliceable): + """Return list of str column names from polars / pyarrow input, + or None if the input doesn't expose a uniform name list.""" + cdef object names + names = getattr(sliceable, 'column_names', None) + if names is not None: + return list(names) + names = getattr(sliceable, 'columns', None) + if names is not None: + return list(names) + return None + + +cdef bint _capsule_polars_dtype_is_string_like(object dtype) except -1: + """polars: Utf8 / String / Categorical / Enum count as string-like.""" + if _POLARS is None: + return False + if dtype == _POLARS.Utf8: + return True + if isinstance(dtype, _POLARS.Categorical): + return True + cdef object enum_t = getattr(_POLARS, 'Enum', None) + if enum_t is not None and isinstance(dtype, enum_t): + return True + return False + + +cdef bint _capsule_pyarrow_type_is_string_like(object field_type) except -1: + """pyarrow: utf8 / large_utf8 / utf8_view, plus Dictionary whose + value type is one of those.""" + if _PYARROW is None: + return False + if (_PYARROW.types.is_string(field_type) + or _PYARROW.types.is_large_string(field_type)): + return True + if _PYARROW.types.is_dictionary(field_type): + value_type = field_type.value_type + if (_PYARROW.types.is_string(value_type) + or _PYARROW.types.is_large_string(value_type)): + return True + return False + + +cdef bint _capsule_pandas_dtype_is_string_like(object dtype) except -1: + cdef object storage + cdef object arrow_type + cdef object cat_dtype + if _PANDAS is None: + return False + if isinstance(dtype, _PANDAS.StringDtype): + storage = getattr(dtype, 'storage', None) + return storage == 'pyarrow' + if isinstance(dtype, _PANDAS.ArrowDtype): + _dataframe_require_pyarrow() + arrow_type = dtype.pyarrow_dtype + return _capsule_pyarrow_type_is_string_like(arrow_type) + if isinstance(dtype, _PANDAS.CategoricalDtype): + cat_dtype = dtype.categories.dtype + if cat_dtype == object: + return True + return _capsule_pandas_dtype_is_string_like(cat_dtype) + return False + + +cdef object _capsule_get_string_column_names(object sliceable): + """Return names of all string-like columns (utf8 / large_utf8 / + utf8_view / dict-of-utf8). Supports polars DataFrame and pyarrow + Table / RecordBatch. Returns None if schema introspection is not + available on the input.""" + cdef object schema + cdef object out + cdef object name + cdef object dtype + cdef object field_type + cdef int i + if _is_pandas_dataframe_object(sliceable): + _dataframe_may_import_deps() + out = [] + for name, dtype in sliceable.dtypes.items(): + if _capsule_pandas_dtype_is_string_like(dtype): + out.append(name) + return out + if _POLARS is not None and isinstance(sliceable, _POLARS_DATAFRAME_T): + out = [] + for name, dtype in sliceable.schema.items(): + if _capsule_polars_dtype_is_string_like(dtype): + out.append(name) + return out + if _PYARROW is None: + try: + _dataframe_require_pyarrow() + except ImportError: + return None + if isinstance(sliceable, (_PYARROW.Table, _PYARROW.RecordBatch)): + schema = sliceable.schema + out = [] + for i in range(len(schema.names)): + field_type = schema.field(i).type + if _capsule_pyarrow_type_is_string_like(field_type): + out.append(schema.names[i]) + return out + return None + + +cdef object _capsule_column_is_string_like(object sliceable, str name): + """Returns True iff `name` is a string-like column on `sliceable`, + False iff it is some other type, or None if schema introspection + is not available on the input.""" + cdef object dtype + cdef object field_type + if _is_pandas_dataframe_object(sliceable): + _dataframe_may_import_deps() + try: + dtype = sliceable.dtypes[name] + except KeyError: + raise KeyError( + f'symbols column {name!r} not found in the dataframe.') + return _capsule_pandas_dtype_is_string_like(dtype) + if _POLARS is not None and isinstance(sliceable, _POLARS_DATAFRAME_T): + try: + dtype = sliceable.schema[name] + except KeyError: + raise KeyError( + f'symbols column {name!r} not found in the dataframe.') + return _capsule_polars_dtype_is_string_like(dtype) + if _PYARROW is None: + try: + _dataframe_require_pyarrow() + except ImportError: + return None + if isinstance(sliceable, (_PYARROW.Table, _PYARROW.RecordBatch)): + try: + field_type = sliceable.schema.field(name).type + except (KeyError, ValueError): + raise KeyError( + f'symbols column {name!r} not found in the dataframe.') + return _capsule_pyarrow_type_is_string_like(field_type) + return None + + +cdef object _capsule_get_dict_string_column_names(object sliceable): + """Return names of dict-encoded string-like columns (polars + Categorical / Enum or pyarrow Dictionary(*, utf8/large_utf8)). + Returns None if schema introspection is not available.""" + cdef object schema + cdef object out + cdef object name + cdef object dtype + cdef object field_type + cdef object value_type + cdef object enum_t + cdef int i + if _is_pandas_dataframe_object(sliceable): + _dataframe_may_import_deps() + out = [] + for name, dtype in sliceable.dtypes.items(): + if (isinstance(dtype, _PANDAS.CategoricalDtype) + and _capsule_pandas_dtype_is_string_like(dtype)): + out.append(name) + return out + if _POLARS is not None and isinstance(sliceable, _POLARS_DATAFRAME_T): + out = [] + enum_t = getattr(_POLARS, 'Enum', None) + for name, dtype in sliceable.schema.items(): + if isinstance(dtype, _POLARS.Categorical): + out.append(name) + elif enum_t is not None and isinstance(dtype, enum_t): + out.append(name) + return out + if _PYARROW is None: + try: + _dataframe_require_pyarrow() + except ImportError: + return None + if isinstance(sliceable, (_PYARROW.Table, _PYARROW.RecordBatch)): + schema = sliceable.schema + out = [] + for i in range(len(schema.names)): + field_type = schema.field(i).type + if _PYARROW.types.is_dictionary(field_type): + value_type = field_type.value_type + if (_PYARROW.types.is_string(value_type) + or _PYARROW.types.is_large_string(value_type)): + out.append(schema.names[i]) + return out + return None + + +cdef object _resolve_symbols_to_overrides(object sliceable, object symbols): + """Translate `symbols` into a list of + (name_bytes, column_sender_arrow_override_symbol, arg) tuples + matching the shape returned by _validate_schema_overrides. Returns: + + - [] for None / 'auto' (no overrides, Rust default applies — + Dictionary columns auto-classify as SymbolDict). + - list for True (auto-detect str cols) / False (force NotSymbol + on every dict-encoded str col) / List[str] / List[int]. + - None if resolution requires introspection not available on the + input; caller falls back to Manual plan. + + arg=0 in the tuple means "mark as SYMBOL"; arg=1 means "force + NOT-SYMBOL" (Rust decodes dict to VARCHAR on emit). See + column_sender.h `column_sender_arrow_override::arg`. + + Raises IngressError(BadDataFrame) when an explicitly-named symbols + entry targets a non-string column (matches Manual plan semantics). + """ + cdef list out + cdef int kind_int = column_sender_arrow_override_symbol + cdef object col_names + cdef object entry + cdef object name + cdef object is_str + cdef int idx + cdef set listed + cdef object dict_names + + if symbols is None or symbols == 'auto': + return [] + + if symbols is False: + col_names = _capsule_get_dict_string_column_names(sliceable) + if col_names is None: + return None + out = [] + for entry in col_names: + out.append((entry.encode('utf-8'), kind_int, 1)) + return out + + if symbols is True: + col_names = _capsule_get_string_column_names(sliceable) + if col_names is None: + return None + out = [] + for entry in col_names: + out.append((entry.encode('utf-8'), kind_int, 0)) + return out + + if not isinstance(symbols, (list, tuple)): + return None + + out = [] + col_names = None + listed = set() + for entry in symbols: + if isinstance(entry, str): + name = entry + elif isinstance(entry, int): + if col_names is None: + col_names = _capsule_get_column_names(sliceable) + if col_names is None: + return None + idx = entry + if idx < 0 or idx >= len(col_names): + raise ValueError( + f'symbols index {idx} out of range ' + f'(have {len(col_names)} columns).') + name = col_names[idx] + else: + raise TypeError( + f'symbols entry must be str or int, got ' + f'{type(entry).__name__}.') + is_str = _capsule_column_is_string_like(sliceable, name) + if is_str is None: + return None + if not is_str: + raise IngressError( + IngressErrorCode.BadDataFrame, + f'Bad argument `symbols`: column {name!r} is not a ' + f'strings column.') + listed.add(name) + out.append((name.encode('utf-8'), kind_int, 0)) + + # Match the row/numpy planner: an explicit symbols list marks only the + # listed columns as symbols; every other dict-encoded (categorical) + # column falls through to a plain VARCHAR field (arg=1, force + # NOT-SYMBOL) rather than being auto-symbolized. + dict_names = _capsule_get_dict_string_column_names(sliceable) + if dict_names is None: + return None + for name in dict_names: + if name not in listed: + out.append((name.encode('utf-8'), kind_int, 1)) + return out + + +cdef object _merge_capsule_overrides( + object symbol_overrides, object validated_overrides): + """Merge symbol overrides into validated schema_overrides. + schema_overrides take precedence on name collision.""" + cdef set explicit_names + cdef list merged + cdef object entry + if not symbol_overrides and validated_overrides is None: + return None + if not symbol_overrides: + return validated_overrides + if validated_overrides is None: + return symbol_overrides + explicit_names = {entry[0] for entry in validated_overrides} + merged = list(validated_overrides) + for entry in symbol_overrides: + if entry[0] not in explicit_names: + merged.append(entry) + return merged + + +cdef bint _is_pandas_dataframe_object(object obj): + cdef object cls + cdef object module + cdef object name + if _PANDAS is not None and isinstance(obj, _PANDAS.DataFrame): + return True + try: + for cls in type(obj).__mro__: + module = getattr(cls, '__module__', '') + name = getattr(cls, '__name__', '') + if (name == 'DataFrame' and + isinstance(module, str) and + (module == 'pandas' or module.startswith('pandas.'))): + return True + except Exception: + return False + return False + + +cdef object _MASKED_DTYPE = None +cdef bint _MASKED_DTYPE_READY = False + + +cdef object _pandas_masked_dtype(): + global _MASKED_DTYPE, _MASKED_DTYPE_READY + if not _MASKED_DTYPE_READY: + try: + from pandas.core.arrays.masked import BaseMaskedDtype + _MASKED_DTYPE = BaseMaskedDtype + except Exception: + _MASKED_DTYPE = () + _MASKED_DTYPE_READY = True + return _MASKED_DTYPE + + +cdef bint _pandas_dataframe_requires_manual_planner(object df) except -1: + # A fully Arrow-backed frame takes the zero-copy capsule path; any + # numpy / object / masked / categorical column routes the whole frame to + # the manual planner (which ingests those directly and the Arrow-backed + # columns via the arrow-import path). + cdef object dtype + cdef object arrow_dtype + if not _is_pandas_dataframe_object(df): + return False + _dataframe_may_import_deps() + arrow_dtype = getattr(_PANDAS, 'ArrowDtype', None) + try: + for dtype in df.dtypes: + if arrow_dtype is not None and isinstance(dtype, arrow_dtype): + continue + if isinstance(dtype, _PANDAS.StringDtype): + if getattr(dtype, 'storage', None) == 'pyarrow': + continue + return True + except Exception: + return True + return False + + +cdef bint _pandas_dataframe_is_timestamp_only_at( + object df, + object at) except -1: + if not _is_pandas_dataframe_object(df) or not isinstance(at, str): + return False + try: + return len(df.columns) == 1 and df.columns[0] == at + except Exception: + return False + + +cdef Py_ssize_t _capsule_row_count(object sliceable) except -2: + cdef object row_count_obj = getattr(sliceable, 'num_rows', None) + if row_count_obj is None: + row_count_obj = getattr(sliceable, 'height', None) + if row_count_obj is not None: + return row_count_obj + if _is_pandas_dataframe_object(sliceable): + return len(sliceable) + return -1 + + +cdef object _capsule_slice_rows( + object sliceable, + Py_ssize_t offset, + Py_ssize_t row_count): + if hasattr(sliceable, 'slice'): + return sliceable.slice(offset, row_count) + if _is_pandas_dataframe_object(sliceable): + return sliceable.iloc[offset:offset + row_count] + return None + + +cdef bint _dataframe_client_try_capsule_path( + questdb_db* db, + uint64_t budget_ms, + object df, + object table_name, + object table_name_col, + object symbols, + object at, + size_t max_rows_per_batch, + object schema_overrides) except -1: + cdef qdb_pystr_buf* b = NULL + cdef column_sender* conn = NULL + cdef line_sender_error* err = NULL + cdef PyThreadState* gs = NULL + cdef object sliceable = None + cdef bint any_flushed = False + cdef bint flush_attempted = False + cdef size_t deferred_since_sync = 0 + cdef bint sync_attempted = False + cdef bint force_drop_conn = False + cdef object row_slice = None + cdef Py_ssize_t total_rows = 0 + cdef Py_ssize_t offset = 0 + cdef Py_ssize_t chunk_rows + cdef object validated_overrides + cdef object symbol_overrides + cdef object merged_overrides + cdef bint can_slice = False + cdef line_sender_table_name c_table_name + cdef line_sender_column_name c_ts_column + cdef line_sender_column_name* c_ts_column_ptr = NULL + cdef ArrowSchema c_schema + cdef column_sender_arrow_override* c_overrides = NULL + cdef size_t c_overrides_len = 0 + cdef bint at_is_column = False + cdef size_t i + cdef object name_bytes + cdef int kind_int + cdef int arg_int + + if _pandas_dataframe_requires_manual_planner(df): + return False + if _pandas_dataframe_is_timestamp_only_at(df, at): + return False + if table_name_col is not None: + return False + + validated_overrides = _validate_schema_overrides(schema_overrides) + + # LazyFrame: prefer the streaming engine (polars 1.0+) for lower + # peak memory. `LazyFrame.collect_batches()` would stream natively + # but upstream marks it unstable and "much slower than native sinks", + # so we materialize and slice downstream. + if _is_polars_dataframe_or_lazy(df) and isinstance( + df, _POLARS_LAZYFRAME_T): + try: + sliceable = df.collect(engine='streaming') + except TypeError: + sliceable = df.collect() + elif hasattr(df, '__arrow_c_stream__'): + sliceable = df + elif hasattr(df, '__arrow_c_array__'): + _dataframe_require_pyarrow() + sliceable = _PYARROW.Table.from_batches( + [_PYARROW.record_batch(df)]) + else: + return False + + total_rows = _capsule_row_count(sliceable) + + if not isinstance(table_name, str): + raise TypeError( + 'table_name must be str for Arrow-native DataFrame input.') + if at is None or isinstance(at, ServerTimestampType): + at_is_column = False + elif isinstance(at, str): + at_is_column = True + else: + raise TypeError( + 'at must be a column name str, ServerTimestamp, or None ' + 'for Arrow-native DataFrame input.') + + # An empty frame is a no-op: emit nothing and skip symbol-shape + # validation, which is moot with zero rows. + if total_rows == 0: + return True + + symbol_overrides = _resolve_symbols_to_overrides(sliceable, symbols) + if symbol_overrides is None: + return False + merged_overrides = _merge_capsule_overrides( + symbol_overrides, validated_overrides) + + can_slice = (total_rows >= 0) and ( + hasattr(sliceable, 'slice') + or _is_pandas_dataframe_object(sliceable)) + + b = qdb_pystr_buf_new() + memset(&c_schema, 0, sizeof(ArrowSchema)) + try: + str_to_table_name(b, table_name, &c_table_name) + if at_is_column: + str_to_column_name(b, at, &c_ts_column) + c_ts_column_ptr = &c_ts_column + + if merged_overrides is not None: + c_overrides_len = len(merged_overrides) + c_overrides = calloc( + c_overrides_len, sizeof(column_sender_arrow_override)) + if c_overrides == NULL: + raise MemoryError() + for i in range(c_overrides_len): + name_bytes, kind_int, arg_int = merged_overrides[i] + c_overrides[i].column = PyBytes_AsString(name_bytes) + c_overrides[i].column_len = PyBytes_GET_SIZE(name_bytes) + c_overrides[i].kind = kind_int + c_overrides[i].arg = arg_int + + _ensure_doesnt_have_gil(&gs) + if budget_ms == 0: + conn = questdb_db_borrow_column_sender(db, &err) + else: + conn = questdb_db_borrow_column_sender_with_retry(db, budget_ms, &err) + _ensure_has_gil(&gs) + if conn == NULL: + raise c_err_to_py(err) + + try: + if not can_slice: + _capsule_consume_stream_with_hint( + conn, sliceable, c_table_name, c_ts_column_ptr, + &c_schema, c_overrides, c_overrides_len, + &any_flushed, &flush_attempted, &deferred_since_sync, + max_rows_per_batch, False) + else: + offset = 0 + while offset < total_rows: + chunk_rows = max_rows_per_batch + if chunk_rows > total_rows - offset: + chunk_rows = total_rows - offset + row_slice = _capsule_slice_rows( + sliceable, offset, chunk_rows) + _capsule_consume_stream_with_hint( + conn, row_slice, c_table_name, c_ts_column_ptr, + &c_schema, c_overrides, c_overrides_len, + &any_flushed, &flush_attempted, &deferred_since_sync, + max_rows_per_batch, True) + offset += chunk_rows + sync_attempted = True + _dataframe_columnar_sync(conn) + except: + force_drop_conn = _dataframe_columnar_force_drop_after_error( + conn, any_flushed, flush_attempted, sync_attempted) + raise + + return True + finally: + _ensure_has_gil(&gs) + if conn != NULL: + if force_drop_conn: + questdb_db_drop_column_sender(db, conn) + else: + questdb_db_return_column_sender(db, conn) + if c_schema.release != NULL: + c_schema.release(&c_schema) + if c_overrides != NULL: + free(c_overrides) + if b != NULL: + qdb_pystr_buf_free(b) + + +cdef void_int _capsule_consume_stream_with_hint( + column_sender* conn, + object stream_owner, + line_sender_table_name c_table_name, + line_sender_column_name* c_ts_column_ptr, + ArrowSchema* c_schema, + const column_sender_arrow_override* c_overrides, + size_t c_overrides_len, + bint* any_flushed, + bint* flush_attempted, + size_t* deferred_since_sync, + size_t max_rows_per_batch, + bint can_slice) except -1: + cdef str hint + try: + _capsule_consume_stream( + conn, stream_owner, c_table_name, c_ts_column_ptr, c_schema, + c_overrides, c_overrides_len, any_flushed, flush_attempted, + deferred_since_sync) + except IngressError as exc: + if _is_batch_too_large_error(exc): + if can_slice: + hint = ( + f'reduce `max_rows_per_batch` (current: ' + f'{max_rows_per_batch}) and retry.') + else: + hint = ( + f'this is a streaming Arrow source (e.g. ' + f'pa.RecordBatchReader); batch size is set by the ' + f'producer and `max_rows_per_batch` (current: ' + f'{max_rows_per_batch}) does not bound it. ' + f'Materialise to a `pa.Table` ' + f'(`pa.Table.from_batches(reader)`) or re-batch ' + f'at the source before passing.') + raise IngressError( + exc.code, f'{exc}\nHint: {hint}') from exc + raise + + +cdef bint _is_batch_too_large_error(object exc): + cdef str msg + if not isinstance(exc, IngressError): + return False + msg = str(exc).lower() + return ( + ('row_count' in msg and ('exceeds' in msg or 'too large' in msg)) + or 'batch too large' in msg + or ('value_data' in msg and 'exceeds' in msg)) + + +cdef class Client: + """ + Pooled QWP/WebSocket client. + + This is the ownership surface for the #148 `questdb_db` pool. DataFrame + ingestion will borrow `column_sender` handles from this pool. + """ + cdef questdb_db* _db + cdef object _conf_str + cdef object _state_cond + cdef size_t _active_uses + + def __cinit__(self): + self._db = NULL + self._conf_str = None + self._state_cond = threading.Condition(threading.RLock()) + self._active_uses = 0 + + cdef questdb_db* _begin_db_use(self, str method) except? NULL: + cdef questdb_db* db = NULL + self._state_cond.acquire() + try: + db = self._db + if db == NULL: + raise IngressError( + IngressErrorCode.InvalidApiCall, + f"{method}() can't be called: Client is closed.") + self._active_uses += 1 + return db + finally: + self._state_cond.release() + + cdef void _end_db_use(self) except *: + self._state_cond.acquire() + try: + if self._active_uses == 0: + raise RuntimeError('Client use counter underflow.') + self._active_uses -= 1 + if self._active_uses == 0: + self._state_cond.notify_all() + finally: + self._state_cond.release() + + @staticmethod + def from_conf(str conf_str): + """ + Construct a pooled client from a QWP/WebSocket configuration string. + + The underlying #148 pool is opened eagerly by `questdb_db_connect`. + Include ``sf_dir=...`` to opt the columnar dataframe path into + store-and-forward mode; without ``sf_dir`` dataframe ingestion uses the + direct QWP/WebSocket column sender. + """ + cdef line_sender_error* err = NULL + cdef line_sender_utf8 c_conf + cdef object protocol + cdef dict params + cdef qdb_pystr_buf* b = qdb_pystr_buf_new() + cdef Client client = Client.__new__(Client) + cdef PyThreadState* gs = NULL + try: + protocol, params = parse_conf_str(b, conf_str) + if protocol not in (Protocol.QwpWs, Protocol.QwpWss): + raise IngressError( + IngressErrorCode.ConfigError, + 'Client.from_conf() requires a QWP/WebSocket ' + 'configuration string: qwpws:: or qwpwss::.') + if params.get('addr') is None: + raise IngressError( + IngressErrorCode.ConfigError, + 'Missing "addr" parameter in config string') + + str_to_utf8(b, conf_str, &c_conf) + _ensure_doesnt_have_gil(&gs) + client._db = questdb_db_connect(c_conf.buf, c_conf.len, &err) + _ensure_has_gil(&gs) + if client._db == NULL: + raise c_err_to_py(err) + client._conf_str = conf_str + return client + finally: + _ensure_has_gil(&gs) + qdb_pystr_buf_free(b) + + def __enter__(self): + self._state_cond.acquire() + try: + if self._db == NULL: + raise IngressError( + IngressErrorCode.InvalidApiCall, + '__enter__() can\'t be called: Client is closed.') + finally: + self._state_cond.release() + return self + + def dataframe( + self, + df, + *, + table_name: Optional[str] = None, + table_name_col: Union[None, int, str] = None, + symbols: Union[str, bool, List[int], List[str]] = 'auto', + at: Union[int, str], + max_rows_per_batch: int = 16384, + schema_overrides: Optional[Dict[str, object]] = None): + """ + Ingest a dataframe through the pooled columnar QWP path. + + When this client was opened with ``sf_dir=...``, + :meth:`Client.dataframe` uses the store-and-forward column sender. Each + batch is accepted into the local SFA queue first, and this method still + waits for ``AckLevel::Ok`` before returning; low-level columnar + ``flush`` calls have the weaker local-acceptance contract. + + ``df`` accepts any of: + + - **pandas** ``pandas.DataFrame``. NumPy-backed columns route + through the legacy planner; pyarrow-backed columns route + through the Arrow C Stream capsule path below. + - **polars** ``polars.DataFrame`` and ``polars.LazyFrame``. + ``LazyFrame`` is materialised via + ``.collect(engine='streaming')`` (eager ``.collect()`` on + polars < 1.0). + - **pyarrow** ``pa.Table``, ``pa.RecordBatch``, and + ``pa.RecordBatchReader``. + - Any object exposing the Arrow C Data Interface — i.e. with + ``__arrow_c_stream__`` (duckdb / cudf / modin / pyarrow-backed + pandas 2.2+) or ``__arrow_c_array__`` (single Arrow array + exporters, wrapped into a one-batch ``pa.Table``). + + Supports a column-QWP v1 subset: fixed ``table_name``, non-null + designated timestamp column, and the following per-column dtypes: + + - **Numeric**: NumPy ``bool/int{8,16,32,64}/uint{8..64}/float{32,64}``. + Arrow ``pa.int{8,16,32,64}``, ``pa.float{16,32,64}``, and + ``pa.uint{8,16,32,64}`` are accepted by the Rust Arrow batch route + when the frame uses a fixed table name and a designated timestamp + column name. Unsigned Arrow values follow the + Rust Arrow policy: ``UInt8`` widens to ``SHORT``, ``UInt16`` to + ``INT``, ``UInt32`` to ``LONG``, and ``UInt64`` values up to + ``i64::MAX`` are accepted as ``LONG``. Larger ``UInt64`` values are + rejected because QuestDB QWP-WS encodes integers as signed ``i64``. + - **String / Symbol**: object-dtype ``str``, ``pa.string()``, + ``pa.large_string()``, ``pd.CategoricalDtype`` of strings. + - **Timestamp**: NumPy ``datetime64`` units accepted by pandas and + ``pa.timestamp`` with unit ``s``, ``ms``, ``us``, or ``ns`` + (tz-aware accepted on Arrow-backed columns in the Rust Arrow route). + QuestDB ``TIMESTAMP`` columns cannot contain nulls/NaT or values + before the Unix epoch. + - **Decimal**: Arrow-backed ``pa.decimal{32,64,128,256}`` columns + (``pa.decimal32``/``pa.decimal64`` require pyarrow >= 18). Plain + object-dtype columns of ``decimal.Decimal`` are not accepted on the + columnar path; back them with an Arrow decimal type instead. + - **UUID**: ``pa.fixed_size_binary(16)`` and the ``arrow.uuid`` + extension type. Bytes are forwarded verbatim as **QuestDB's + UUID wire layout** ("bytes 0..8 lo half LE, bytes 8..16 hi + half LE"), matching the convention shared across the + c-questdb-client family (Rust direct, Polars). Round-trip is + byte-identity at this layout; users who want + ``uuid.UUID.bytes`` (RFC 4122 big-endian) round-trip must + convert at their boundary. + + Server-side coercion handles cross-type writes (e.g. ``pa.string()`` + UUIDs landing in a UUID column are parsed server-side; narrow ints + landing in a wider column are widened). Failures surface as + ``IngressError`` from the ``flush()``. + """ + cdef qdb_pystr_buf* b = qdb_pystr_buf_new() + cdef dataframe_plan_t plan = dataframe_plan_blank() + cdef questdb_db* db = NULL + cdef bint db_use = False + cdef uint64_t budget_ms = 0 + cdef double deadline = 0.0 + cdef double remaining = 0.0 + db = self._begin_db_use('dataframe') + db_use = True + try: + if max_rows_per_batch <= 0: + raise ValueError('max_rows_per_batch must be >= 1.') + if not isinstance(at, str) and not ( + isinstance(at, int) and not isinstance(at, bool)): + raise UnsupportedDataFrameShapeError( + 'Client.dataframe requires `at` to name the designated ' + 'timestamp column (by name or index); scalar timestamps ' + 'are not supported on the columnar path.') + # Overall failover deadline, matching the row sender's + # `reconnect_max_duration` budget. + deadline = time.monotonic() + \ + questdb_db_reconnect_max_duration_ms(db) / 1000.0 + while True: + # Reclaim string storage from a prior attempt's released plan. + qdb_pystr_buf_clear(b) + try: + if _dataframe_client_try_capsule_path( + db, + budget_ms, + df, + table_name, + table_name_col, + symbols, + at, + max_rows_per_batch, + schema_overrides): + return self + return self._dataframe_numpy_publish( + db, budget_ms, b, &plan, df, table_name, + table_name_col, symbols, at, max_rows_per_batch) + except IngressError as exc: + # FailoverRetry = transient flush/sync; SocketError = a + # re-borrow that has not reached a live primary yet. + if exc.code not in ( + IngressErrorCode.FailoverRetry, + IngressErrorCode.SocketError): + raise + remaining = deadline - time.monotonic() + if remaining <= 0.0: + raise + # The next attempt re-borrows with the row API's reconnect + # backoff (`borrow_conn_with_retry`), bounded by the + # remaining budget; no extra client-side sleep. + budget_ms = (remaining * 1000.0) + finally: + qdb_pystr_buf_free(b) + if db_use: + self._end_db_use() + + cdef object _dataframe_numpy_publish( + self, + questdb_db* db, + uint64_t budget_ms, + qdb_pystr_buf* b, + dataframe_plan_t* plan, + object df, + object table_name, + object table_name_col, + object symbols, + object at, + size_t max_rows_per_batch): + cdef column_sender_chunk* chunk = NULL + cdef column_sender* conn = NULL + cdef line_sender_error* err = NULL + cdef PyThreadState* gs = NULL + cdef bint flushed = False + cdef bint sync_attempted = False + cdef bint force_drop_conn = False + cdef bint flush_attempted = False + cdef size_t rows_per_chunk + cdef size_t row_offset + cdef size_t chunk_rows + try: + df = _dataframe_normalize_nullable(df) + df = _dataframe_normalize_at_timestamp(df, at) + _dataframe_plan_build( + b, + df, + table_name, + table_name_col, + symbols, + at, + plan, + _FIELD_TARGETS_QWP) + if (plan.col_count == 0) or (plan.row_count == 0): + return self + + _dataframe_apply_roundtrip_overrides(df, plan) + _dataframe_columnar_validate_plan(df, plan) + _dataframe_columnar_prebuild_pyobj(df, plan) + rows_per_chunk = _dataframe_columnar_rows_per_chunk( + plan, max_rows_per_batch) + + _ensure_doesnt_have_gil(&gs) + if budget_ms == 0: + conn = questdb_db_borrow_column_sender(db, &err) + else: + conn = questdb_db_borrow_column_sender_with_retry(db, budget_ms, &err) + _ensure_has_gil(&gs) + if conn == NULL: + raise c_err_to_py(err) + + chunk = column_sender_chunk_new( + plan.c_table_name.buf, + plan.c_table_name.len, + &err) + if chunk == NULL: + raise c_err_to_py(err) + try: + row_offset = 0 + while row_offset < plan.row_count: + if not column_sender_chunk_clear(chunk, &err): + raise c_err_to_py(err) + chunk_rows = rows_per_chunk + if chunk_rows > plan.row_count - row_offset: + chunk_rows = plan.row_count - row_offset + _dataframe_columnar_populate_chunk( + plan, + chunk, + row_offset, + chunk_rows) + flush_attempted = True + _dataframe_columnar_flush( + conn, + chunk, + row_offset != 0) + flushed = True + row_offset += chunk_rows + + sync_attempted = True + _dataframe_columnar_sync(conn) + except: + force_drop_conn = _dataframe_columnar_force_drop_after_error( + conn, flushed, flush_attempted, sync_attempted) + raise + + return self + finally: + _ensure_has_gil(&gs) + if conn != NULL: + if force_drop_conn: + questdb_db_drop_column_sender(db, conn) + else: + questdb_db_return_column_sender(db, conn) + if chunk != NULL: + column_sender_chunk_free(chunk) + # The plan is rebuilt on each failover attempt; release this + # attempt's plan so a re-send starts from a blank plan. + dataframe_plan_release(plan) + plan[0] = dataframe_plan_blank() + + def query(self, str sql) -> QueryResult: + """ + Execute a SQL query and return a :class:`QueryResult`. + + Egress goes through the QuestDB Wire Protocol (QWP/WebSocket) + ``/read/v1`` endpoint. The reader is borrowed from the same + connection pool that hosts the ingress writers and is returned to + the pool when the returned :class:`QueryResult` is consumed or + closed (a poisoned connection is dropped instead). Auth / TLS + settings apply to both directions. + + :param sql: SQL text to execute. Forwarded verbatim to QuestDB. + + :return: A :class:`QueryResult`. Materialise it via + ``to_pandas()``, ``to_arrow()``, ``iter_arrow()``, + ``iter_pandas()``, or the ``__arrow_c_stream__`` PyCapsule + protocol. + + Sentinel-value collisions in the result frame round-trip QuestDB's + contract: ``INT64_MIN`` in a LONG column, NaN in DOUBLE / FLOAT, + and the sentinel values for INT / DATE / TIMESTAMP / + TIMESTAMP_NS / CHAR / UUID / LONG256 / IPV4 / GEOHASH are all + interpreted as NULL by QuestDB and cannot be distinguished from + legitimate occurrences of those values. + """ + # Borrow a reader from the same `questdb_db` pool that hosts + # the ingress writers. The pool amortises TCP+TLS handshake + # cost across many `Client.query()` calls: the first call + # opens a connection, subsequent calls hit the idle-list + # cache. See `c-questdb-client/questdb-rs/src/ingress/ + # column_sender/db.rs` for the pool's structure. + cdef _ReaderHandle reader_handle + cdef _CursorHandle cursor_handle + cdef questdb_db* db + db = self._begin_db_use('query') + try: + reader_handle = _borrow_reader_from_pool(db) + cursor_handle = _execute_query(reader_handle, sql) + finally: + self._end_db_use() + return QueryResult(cursor_handle) + + def reap_idle(self): + """ + Manually reap idle above-pool-size connections. + """ + cdef size_t closed + cdef PyThreadState* gs = NULL + cdef questdb_db* db = NULL + cdef bint db_use = False + db = self._begin_db_use('reap_idle') + db_use = True + try: + _ensure_doesnt_have_gil(&gs) + closed = questdb_db_reap_idle(db) + _ensure_has_gil(&gs) + return closed + finally: + _ensure_has_gil(&gs) + if db_use: + self._end_db_use() + + cpdef close(self): + """ + Close the client and its connection pool. + + This method is idempotent. + """ + cdef questdb_db* db = NULL + cdef PyThreadState* gs = NULL + self._state_cond.acquire() + try: + db = self._db + if db == NULL: + return + self._db = NULL + self._conf_str = None + while self._active_uses != 0: + self._state_cond.wait() + finally: + self._state_cond.release() + _ensure_doesnt_have_gil(&gs) + # `questdb_db_close` drains both the writer and reader free + # lists in one shot (see `db.rs::DbInner::Drop`). + questdb_db_close(db) + _ensure_has_gil(&gs) + + def __exit__(self, exc_type, _exc_val, _exc_tb): + self.close() + + def __dealloc__(self): + cdef questdb_db* db + cdef PyThreadState* gs = NULL + if self._db != NULL: + db = self._db + self._db = NULL + _ensure_doesnt_have_gil(&gs) + questdb_db_close(db) + _ensure_has_gil(&gs) + + +cdef class Sender: + """ + Ingest data into QuestDB. + + See the :ref:`sender` documentation for more information. + """ + + # We need the Buffer held by a Sender can hold a weakref to its Sender. + # This avoids a circular reference that requires the GC to clean up. + cdef object __weakref__ + + cdef line_sender_protocol _c_protocol + cdef line_sender_opts* _opts + cdef line_sender* _impl + cdef Buffer _buffer + cdef object _qwp_ws_error_handler + cdef auto_flush_mode_t _auto_flush_mode + cdef int64_t* _last_flush_ms + cdef size_t _init_buf_size + cdef bint _in_txn + cdef int64_t _slot_id + + cdef void_int _set_sender_fields( + self, + qdb_pystr_buf* b, + object protocol, + str bind_interface, + str username, + str password, + str token, + str token_x, + str token_y, + object auth_timeout, + object tls_verify, + object tls_ca, + object tls_roots, + object tls_roots_password, + object max_buf_size, + object retry_timeout, + object retry_max_backoff, + object request_min_throughput, + object request_timeout, + object auto_flush, + object auto_flush_rows, + object auto_flush_bytes, + object auto_flush_interval, + object max_datagram_size, + object multicast_ttl, + object protocol_version, + object qwp_ws_progress, + object qwp_ws_error_handler, + object init_buf_size, + object max_name_len) except -1: + """ + Set optional parameters for the sender. + """ + cdef line_sender_error* err = NULL + cdef str user_agent = 'questdb/python/' + VERSION + cdef line_sender_utf8 c_user_agent + cdef line_sender_utf8 c_bind_interface + cdef line_sender_utf8 c_username + cdef line_sender_utf8 c_password + cdef line_sender_utf8 c_token + cdef line_sender_utf8 c_token_x + cdef line_sender_utf8 c_token_y + cdef uint64_t c_auth_timeout + cdef bint c_tls_verify + cdef line_sender_ca c_tls_ca + cdef line_sender_utf8 c_tls_roots + cdef line_sender_utf8 c_tls_roots_password + cdef uint64_t c_max_buf_size + cdef uint64_t c_retry_timeout + cdef uint64_t c_retry_max_backoff + cdef uint64_t c_request_min_throughput + cdef uint64_t c_request_timeout + cdef size_t c_max_datagram_size = 0 + cdef uint32_t c_multicast_ttl = 0 + cdef line_sender_qwpws_progress c_qwp_ws_progress + + self._c_protocol = protocol.c_value + + # It's OK to override this setting. + str_to_utf8(b, user_agent, &c_user_agent) + if not line_sender_opts_user_agent(self._opts, c_user_agent, &err): + raise c_err_to_py(err) + + if bind_interface is not None: + str_to_utf8(b, bind_interface, &c_bind_interface) + if not line_sender_opts_bind_interface( + self._opts, c_bind_interface, &err): + raise c_err_to_py(err) + + if max_datagram_size is not None: + if not isinstance(max_datagram_size, int) or isinstance(max_datagram_size, bool): + raise TypeError( + '"max_datagram_size" must be a positive int, ' + f'not {_fqn(type(max_datagram_size))}') + if max_datagram_size <= 0 or max_datagram_size > 65507: + raise ValueError( + '"max_datagram_size" must be an int between 1 and 65507, ' + f'not {max_datagram_size!r}') + c_max_datagram_size = max_datagram_size + if not line_sender_opts_max_datagram_size( + self._opts, c_max_datagram_size, &err): + raise c_err_to_py(err) + + if multicast_ttl is not None: + if not isinstance(multicast_ttl, int) or isinstance(multicast_ttl, bool): + raise TypeError( + '"multicast_ttl" must be an int (0-255), ' + f'not {_fqn(type(multicast_ttl))}') + if multicast_ttl < 0 or multicast_ttl > 255: + raise ValueError( + '"multicast_ttl" must be an int (0-255), ' + f'not {multicast_ttl!r}') + c_multicast_ttl = multicast_ttl + if not line_sender_opts_multicast_ttl( + self._opts, c_multicast_ttl, &err): + raise c_err_to_py(err) + + if qwp_ws_progress is not None: + c_qwp_ws_progress = QwpWsProgress.parse(qwp_ws_progress).c_value + if not line_sender_opts_qwpws_progress( + self._opts, c_qwp_ws_progress, &err): + raise c_err_to_py(err) + + if qwp_ws_error_handler is not None and not callable(qwp_ws_error_handler): + raise TypeError( + '"qwp_ws_error_handler" must be callable or None, ' + f'not {_fqn(type(qwp_ws_error_handler))}') + if qwp_ws_error_handler is not None and not _is_qwp_ws_protocol(self._c_protocol): + raise IngressError( + IngressErrorCode.InvalidApiCall, + 'qwp_ws_error_handler is only supported for QWP/WebSocket senders.') + if _is_qwp_ws_protocol(self._c_protocol): + if qwp_ws_error_handler is None: + qwp_ws_error_handler = _default_qwp_ws_error_handler + self._qwp_ws_error_handler = qwp_ws_error_handler + if not line_sender_opts_qwpws_error_handler( + self._opts, + _qwp_ws_error_trampoline, + self._qwp_ws_error_handler, + &err): + self._qwp_ws_error_handler = None + raise c_err_to_py(err) + + if username is not None: + str_to_utf8(b, username, &c_username) + if not line_sender_opts_username(self._opts, c_username, &err): + raise c_err_to_py(err) + + if password is not None: + str_to_utf8(b, password, &c_password) + if not line_sender_opts_password(self._opts, c_password, &err): + raise c_err_to_py(err) + + if token is not None: + str_to_utf8(b, token, &c_token) + if not line_sender_opts_token(self._opts, c_token, &err): + raise c_err_to_py(err) + + if token_x is not None: + str_to_utf8(b, token_x, &c_token_x) + if not line_sender_opts_token_x(self._opts, c_token_x, &err): + raise c_err_to_py(err) + + if token_y is not None: + str_to_utf8(b, token_y, &c_token_y) + if not line_sender_opts_token_y(self._opts, c_token_y, &err): + raise c_err_to_py(err) + + if protocol_version is not None: + if protocol_version == 'auto': + pass + elif (protocol_version == 1) or (protocol_version == '1'): + if not line_sender_opts_protocol_version( + self._opts, line_sender_protocol_version_1, &err): + raise c_err_to_py(err) + elif (protocol_version == 2) or (protocol_version == '2'): + if not line_sender_opts_protocol_version( + self._opts, line_sender_protocol_version_2, &err): + raise c_err_to_py(err) + elif (protocol_version == 3) or (protocol_version == '3'): + if not line_sender_opts_protocol_version( + self._opts, line_sender_protocol_version_3, &err): + raise c_err_to_py(err) + else: + raise IngressError( + IngressErrorCode.ConfigError, + '"protocol_version" must be None, "auto", 1-3' + + f' not {protocol_version!r}') + + if auth_timeout is not None: + if _is_int_not_bool(auth_timeout): + c_auth_timeout = auth_timeout + elif isinstance(auth_timeout, cp_timedelta): + c_auth_timeout = _timedelta_to_millis(auth_timeout) + else: + raise TypeError( + '"auth_timeout" must be an int or a timedelta, ' + f'not {_fqn(type(auth_timeout))}') + if not line_sender_opts_auth_timeout(self._opts, c_auth_timeout, &err): + raise c_err_to_py(err) + + if tls_verify is not None: + if (tls_verify is True) or (tls_verify == 'on'): + c_tls_verify = True + elif (tls_verify is False) or (tls_verify == 'unsafe_off'): + c_tls_verify = False + else: + raise ValueError( + '"tls_verify" must be a bool, "on" or "unsafe_off", ' + f'not {tls_verify!r}') + if not line_sender_opts_tls_verify(self._opts, c_tls_verify, &err): + raise c_err_to_py(err) + + if tls_roots is not None: + tls_roots = str(tls_roots) + str_to_utf8(b, tls_roots, &c_tls_roots) + if not line_sender_opts_tls_roots(self._opts, c_tls_roots, &err): + raise c_err_to_py(err) + + if tls_roots_password is not None: + str_to_utf8(b, tls_roots_password, &c_tls_roots_password) + if not line_sender_opts_tls_roots_password( + self._opts, c_tls_roots_password, &err): + raise c_err_to_py(err) + + if tls_ca is not None: + c_tls_ca = TlsCa.parse(tls_ca).c_value + if not line_sender_opts_tls_ca(self._opts, c_tls_ca, &err): raise c_err_to_py(err) elif protocol.tls_enabled and tls_roots is None: # Set different default for Python than the the Rust default. @@ -2024,7 +5768,7 @@ cdef class Sender: raise c_err_to_py(err) if retry_timeout is not None: - if isinstance(retry_timeout, int): + if _is_int_not_bool(retry_timeout): c_retry_timeout = retry_timeout if not line_sender_opts_retry_timeout(self._opts, c_retry_timeout, &err): raise c_err_to_py(err) @@ -2037,6 +5781,22 @@ cdef class Sender: '"retry_timeout" must be an int or a timedelta, ' f'not {_fqn(type(retry_timeout))}') + if retry_max_backoff is not None: + if _is_int_not_bool(retry_max_backoff): + c_retry_max_backoff = retry_max_backoff + if not line_sender_opts_retry_max_backoff( + self._opts, c_retry_max_backoff, &err): + raise c_err_to_py(err) + elif isinstance(retry_max_backoff, cp_timedelta): + c_retry_max_backoff = _timedelta_to_millis(retry_max_backoff) + if not line_sender_opts_retry_max_backoff( + self._opts, c_retry_max_backoff, &err): + raise c_err_to_py(err) + else: + raise TypeError( + '"retry_max_backoff" must be an int or a timedelta, ' + f'not {_fqn(type(retry_max_backoff))}') + if request_min_throughput is not None: c_request_min_throughput = request_min_throughput if not line_sender_opts_request_min_throughput(self._opts, c_request_min_throughput, &err): @@ -2048,7 +5808,7 @@ cdef class Sender: raise c_err_to_py(err) if request_timeout is not None: - if isinstance(request_timeout, int): + if _is_int_not_bool(request_timeout): c_request_timeout = request_timeout if not line_sender_opts_request_timeout(self._opts, c_request_timeout, &err): raise c_err_to_py(err) @@ -2067,7 +5827,8 @@ cdef class Sender: auto_flush_rows, auto_flush_bytes, auto_flush_interval, - &self._auto_flush_mode) + &self._auto_flush_mode, + c_max_datagram_size) self._init_buf_size = init_buf_size or 65536 self._last_flush_ms = calloc(1, sizeof(int64_t)) @@ -2077,6 +5838,7 @@ cdef class Sender: self._opts = NULL self._impl = NULL self._buffer = None + self._qwp_ws_error_handler = None self._auto_flush_mode.enabled = False self._last_flush_ms = NULL self._init_buf_size = 0 @@ -2099,14 +5861,20 @@ cdef class Sender: object tls_verify=None, # default: True object tls_ca=None, # default: TlsCa.WebpkiRoots object tls_roots=None, + str tls_roots_password=None, object max_buf_size=None, # 100 * 1024 * 1024 - 100MiB object retry_timeout=None, # default: 10000 milliseconds + object retry_max_backoff=None, # default: 1000 milliseconds object request_min_throughput=None, # default: 100 * 1024 - 100KiB/s object request_timeout=None, object auto_flush=None, # Default True object auto_flush_rows=None, # Default 75000 (HTTP) or 600 (TCP) object auto_flush_bytes=None, # Default off object auto_flush_interval=None, # Default 1000 milliseconds + object max_datagram_size=None, # Default 1400 for QWP/UDP + object multicast_ttl=None, # Default 1 for QWP/UDP + object qwp_ws_progress=None, # Default background for QWP/WebSocket + object qwp_ws_error_handler=None, object protocol_version=None, # Default auto object init_buf_size=None, # 64KiB object max_name_len=None): # 127 @@ -2143,15 +5911,21 @@ cdef class Sender: tls_verify, tls_ca, tls_roots, + tls_roots_password, max_buf_size, retry_timeout, + retry_max_backoff, request_min_throughput, request_timeout, auto_flush, auto_flush_rows, auto_flush_bytes, auto_flush_interval, + max_datagram_size, + multicast_ttl, protocol_version, + qwp_ws_progress, + qwp_ws_error_handler, init_buf_size, max_name_len) finally: @@ -2171,14 +5945,20 @@ cdef class Sender: object tls_verify=None, # default: True object tls_ca=None, # default: TlsCa.WebpkiRoots object tls_roots=None, + str tls_roots_password=None, object max_buf_size=None, # 100 * 1024 * 1024 - 100MiB object retry_timeout=None, # default: 10000 milliseconds + object retry_max_backoff=None, # default: 1000 milliseconds object request_min_throughput=None, # default: 100 * 1024 - 100KiB/s object request_timeout=None, object auto_flush=None, # Default True object auto_flush_rows=None, # Default 75000 (HTTP) or 600 (TCP) object auto_flush_bytes=None, # Default off object auto_flush_interval=None, # Default 1000 milliseconds + object max_datagram_size=None, # Default 1400 for QWP/UDP + object multicast_ttl=None, # Default 1 for QWP/UDP + object qwp_ws_progress=None, # Default background for QWP/WebSocket + object qwp_ws_error_handler=None, object protocol_version=None, # Default auto object init_buf_size=None, # 64KiB object max_name_len=None): # 127 @@ -2208,11 +5988,6 @@ cdef class Sender: IngressErrorCode.ConfigError, 'Missing "addr" parameter in config string') - if 'tls_roots_password' in params: - raise IngressError( - IngressErrorCode.ConfigError, - '"tls_roots_password" is not supported in the conf_str.') - # add fields to the dictionary, so long as they aren't already # present in the params dictionary for override_key, override_value in { @@ -2226,14 +6001,19 @@ cdef class Sender: 'tls_verify': tls_verify, 'tls_ca': tls_ca, 'tls_roots': tls_roots, + 'tls_roots_password': tls_roots_password, 'max_buf_size': max_buf_size, 'retry_timeout': retry_timeout, + 'retry_max_backoff_millis': retry_max_backoff, 'request_min_throughput': request_min_throughput, 'request_timeout': request_timeout, 'auto_flush': auto_flush, 'auto_flush_rows': auto_flush_rows, 'auto_flush_bytes': auto_flush_bytes, 'auto_flush_interval': auto_flush_interval, + 'max_datagram_size': max_datagram_size, + 'multicast_ttl': multicast_ttl, + 'qwp_ws_progress': qwp_ws_progress, 'protocol_version': protocol_version, 'init_buf_size': init_buf_size, 'max_name_len': max_name_len, @@ -2248,11 +6028,48 @@ cdef class Sender: sender = Sender.__new__(Sender) - # Forward only the `addr=` parameter to the C API. - synthetic_conf_str = f'{protocol.tag}::addr={addr};' + python_handled_keys = { + 'addr', + 'bind_interface', + 'username', + 'password', + 'token', + 'token_x', + 'token_y', + 'auth_timeout', + 'tls_verify', + 'tls_ca', + 'tls_roots', + 'tls_roots_password', + 'max_buf_size', + 'retry_timeout', + 'retry_max_backoff_millis', + 'request_min_throughput', + 'request_timeout', + 'auto_flush', + 'auto_flush_rows', + 'auto_flush_bytes', + 'auto_flush_interval', + 'max_datagram_size', + 'multicast_ttl', + 'qwp_ws_progress', + 'protocol_version', + 'init_buf_size', + 'max_name_len', + } + synthetic_params = {'addr': addr} + if protocol in (Protocol.QwpWs, Protocol.QwpWss): + for key, value in params.items(): + if key not in python_handled_keys: + synthetic_params[key] = value + synthetic_conf_str = protocol.tag + '::' + ''.join( + f'{key}={conf_str_value(value)};' + for key, value in synthetic_params.items()) str_to_utf8(b, synthetic_conf_str, &c_synthetic_conf_str) sender._opts = line_sender_opts_from_conf( c_synthetic_conf_str, &err) + if sender._opts == NULL: + raise c_err_to_py(err) sender._set_sender_fields( b, @@ -2267,15 +6084,21 @@ cdef class Sender: params.get('tls_verify'), params.get('tls_ca'), params.get('tls_roots'), + params.get('tls_roots_password'), params.get('max_buf_size'), params.get('retry_timeout'), + params.get('retry_max_backoff_millis'), params.get('request_min_throughput'), params.get('request_timeout'), params.get('auto_flush'), params.get('auto_flush_rows'), params.get('auto_flush_bytes'), params.get('auto_flush_interval'), + params.get('max_datagram_size'), + params.get('multicast_ttl'), params.get('protocol_version'), + params.get('qwp_ws_progress'), + qwp_ws_error_handler, params.get('init_buf_size'), params.get('max_name_len')) @@ -2296,14 +6119,20 @@ cdef class Sender: object tls_verify=None, # default: True object tls_ca=None, # default: TlsCa.WebpkiRoots object tls_roots=None, + str tls_roots_password=None, object max_buf_size=None, # 100 * 1024 * 1024 - 100MiB object retry_timeout=None, # default: 10000 milliseconds + object retry_max_backoff=None, # default: 1000 milliseconds object request_min_throughput=None, # default: 100 * 1024 - 100KiB/s object request_timeout=None, object auto_flush=None, # Default True object auto_flush_rows=None, # Default 75000 (HTTP) or 600 (TCP) object auto_flush_bytes=None, # Default off object auto_flush_interval=None, # Default 1000 milliseconds + object max_datagram_size=None, # Default 1400 for QWP/UDP + object multicast_ttl=None, # Default 1 for QWP/UDP + object qwp_ws_progress=None, # Default background for QWP/WebSocket + object qwp_ws_error_handler=None, object protocol_version=None, # Default auto object init_buf_size=None, # 64KiB object max_name_len=None): # 127 @@ -2336,19 +6165,34 @@ cdef class Sender: tls_verify=tls_verify, tls_ca=tls_ca, tls_roots=tls_roots, + tls_roots_password=tls_roots_password, max_buf_size=max_buf_size, retry_timeout=retry_timeout, + retry_max_backoff=retry_max_backoff, request_min_throughput=request_min_throughput, request_timeout=request_timeout, auto_flush=auto_flush, auto_flush_rows=auto_flush_rows, auto_flush_bytes=auto_flush_bytes, auto_flush_interval=auto_flush_interval, + max_datagram_size=max_datagram_size, + multicast_ttl=multicast_ttl, + qwp_ws_progress=qwp_ws_progress, + qwp_ws_error_handler=qwp_ws_error_handler, protocol_version=protocol_version, init_buf_size=init_buf_size, max_name_len=max_name_len) + cdef inline object _new_buffer_for_sender(self): + cdef Buffer buf = Buffer.__new__(Buffer) + buf._impl = line_sender_buffer_new_for_sender(self._impl) + buf._b = qdb_pystr_buf_new() + reserve_buffer(buf._impl, self._init_buf_size) + buf._init_buf_size = self._init_buf_size + buf._max_name_len = line_sender_get_max_name_len(self._impl) + return buf + def new_buffer(self): """ Make a new configured buffer. @@ -2356,10 +6200,15 @@ cdef class Sender: The buffer is set up with the configured `init_buf_size` and `max_name_len`. """ - return Buffer( - protocol_version=self.protocol_version, - init_buf_size=self._init_buf_size, - max_name_len=self.max_name_len) + if self._impl == NULL: + if self._opts == NULL: + raise IngressError( + IngressErrorCode.InvalidApiCall, + 'new_buffer() can\'t be called: Sender is closed.') + raise IngressError( + IngressErrorCode.InvalidApiCall, + 'new_buffer() can\'t be called before establish().') + return self._new_buffer_for_sender() @property def init_buf_size(self) -> int: @@ -2433,6 +6282,10 @@ cdef class Sender: raise IngressError( IngressErrorCode.InvalidApiCall, 'protocol_version() can\'t be called: Sender is closed.') + if _is_qwp_udp_protocol(self._c_protocol): + raise IngressError( + IngressErrorCode.InvalidApiCall, + 'protocol_version is not applicable for QWP/UDP senders.') return line_sender_get_protocol_version(self._impl) def establish(self): @@ -2464,10 +6317,7 @@ cdef class Sender: raise c_err_to_py(err) if self._buffer is None: - self._buffer = Buffer( - protocol_version=self.protocol_version, - init_buf_size=self._init_buf_size, - max_name_len=self.max_name_len) + self._buffer = self._new_buffer_for_sender() line_sender_opts_free(self._opts) self._opts = NULL @@ -2503,6 +6353,10 @@ cdef class Sender: The ``bytes`` value returned represents the unsent data. + For QWP/UDP senders this always returns ``b''`` because encoding + is deferred to flush. Use :func:`Sender.__len__` instead for a + size estimate. + Also see :func:`Sender.__len__`. """ if self._buffer is None: @@ -2515,6 +6369,9 @@ cdef class Sender: Number of bytes of unsent data in the internal buffer. Equivalent (but cheaper) to ``len(bytes(sender))``. + + For QWP/UDP senders this returns an estimated size hint, not the + exact serialized byte count. """ if self._buffer is None: return 0 @@ -2533,7 +6390,7 @@ cdef class Sender: symbols: Optional[Dict[str, str]]=None, columns: Optional[Dict[ str, - Union[None, bool, int, float, str, TimestampMicros, datetime.datetime, numpy.ndarray]]]=None, + Union[None, bool, int, float, str, TimestampMicros, datetime.datetime, numpy.ndarray, Decimal]]]=None, at: Union[TimestampNanos, datetime.datetime, ServerTimestampType]): """ Write a row to the internal buffer. @@ -2661,6 +6518,12 @@ cdef class Sender: :param buffer: The buffer to flush. If ``None``, the internal buffer is flushed. + With QWP/WebSocket, this publishes the buffer into the local sender + queue and returns before the server necessarily ACKs the frame. Later + terminal diagnostics fail subsequent sender calls and are available as + :attr:`IngressError.qwp_ws_error`. Server diagnostics are also + available through :func:`Sender.poll_qwp_ws_error`. + :param clear: If ``True``, the flushed buffer is cleared (default). If ``False``, the flushed buffer is left in the internal buffer. Note that ``clear=False`` is only supported if ``buffer`` is also @@ -2692,10 +6555,11 @@ cdef class Sender: IngressErrorCode.InvalidApiCall, 'flush() can\'t be called: Sender is closed.') if buffer is not None: + buffer._check_impl() c_buf = buffer._impl else: c_buf = self._buffer._impl - if line_sender_buffer_size(c_buf) == 0: + if line_sender_buffer_size(c_buf) == 0 and not _is_qwp_ws_protocol(self._c_protocol): return # We might be blocking on IO, so temporarily release the GIL. @@ -2712,9 +6576,9 @@ cdef class Sender: ok = line_sender_flush(sender, c_buf, &err) else: ok = line_sender_flush_and_keep(sender, c_buf, &err) + _ensure_has_gil(&gs) if ok and c_buf == self._buffer._impl: self._last_flush_ms[0] = line_sender_now_micros() // 1000 - _ensure_has_gil(&gs) if not ok: if c_buf == self._buffer._impl: # Prevent a follow-up call to `.close(flush=True)` (as is @@ -2728,12 +6592,216 @@ cdef class Sender: else: raise c_err_to_py(err) + cdef inline void_int _check_qwp_ws(self, str method) except -1: + if self._impl == NULL: + raise IngressError( + IngressErrorCode.InvalidApiCall, + f'{method}() can\'t be called: Sender is closed.') + if not _is_qwp_ws_protocol(self._c_protocol): + raise IngressError( + IngressErrorCode.InvalidApiCall, + f'{method}() is only supported for QWP/WebSocket senders.') + + def flush_and_get_fsn(self, Buffer buffer=None): + """ + Publish a QWP/WebSocket buffer locally, clear it on success, and return + the assigned frame sequence number. + """ + cdef line_sender* sender = self._impl + cdef line_sender_error* err = NULL + cdef line_sender_buffer* c_buf = NULL + cdef line_sender_qwpws_fsn fsn + cdef PyThreadState* gs = NULL + cdef bint ok = False + + if self._in_txn: + raise IngressError( + IngressErrorCode.InvalidApiCall, + 'Cannot flush explicitly inside a transaction') + self._check_qwp_ws('flush_and_get_fsn') + if buffer is not None: + buffer._check_impl() + c_buf = buffer._impl + else: + c_buf = self._buffer._impl + + _ensure_doesnt_have_gil(&gs) + ok = line_sender_qwpws_flush_and_get_fsn(sender, c_buf, &fsn, &err) + _ensure_has_gil(&gs) + if not ok: + raise c_err_to_py(err) + if c_buf == self._buffer._impl: + self._last_flush_ms[0] = line_sender_now_micros() // 1000 + if fsn.has_value: + return fsn.value + return None + + def flush_and_keep_and_get_fsn(self, Buffer buffer=None): + """ + Publish a QWP/WebSocket buffer locally without clearing it and return + the assigned frame sequence number. + """ + cdef line_sender* sender = self._impl + cdef line_sender_error* err = NULL + cdef line_sender_buffer* c_buf = NULL + cdef line_sender_qwpws_fsn fsn + cdef PyThreadState* gs = NULL + cdef bint ok = False + + if self._in_txn: + raise IngressError( + IngressErrorCode.InvalidApiCall, + 'Cannot flush explicitly inside a transaction') + self._check_qwp_ws('flush_and_keep_and_get_fsn') + if buffer is not None: + buffer._check_impl() + c_buf = buffer._impl + else: + c_buf = self._buffer._impl + + _ensure_doesnt_have_gil(&gs) + ok = line_sender_qwpws_flush_and_keep_and_get_fsn( + sender, c_buf, &fsn, &err) + _ensure_has_gil(&gs) + if not ok: + raise c_err_to_py(err) + if c_buf == self._buffer._impl: + self._last_flush_ms[0] = line_sender_now_micros() // 1000 + if fsn.has_value: + return fsn.value + return None + + def published_fsn(self): + """ + Highest QWP/WebSocket frame sequence number published locally. + """ + cdef line_sender_qwpws_fsn fsn + cdef line_sender_error* err = NULL + + self._check_qwp_ws('published_fsn') + if not line_sender_qwpws_published_fsn(self._impl, &fsn, &err): + raise c_err_to_py(err) + if fsn.has_value: + return fsn.value + return None + + def acked_fsn(self): + """ + Highest QWP/WebSocket frame sequence number completed by ACK or + drop-and-continue rejection. + """ + cdef line_sender_qwpws_fsn fsn + cdef line_sender_error* err = NULL + + self._check_qwp_ws('acked_fsn') + if not line_sender_qwpws_acked_fsn(self._impl, &fsn, &err): + raise c_err_to_py(err) + if fsn.has_value: + return fsn.value + return None + + def await_acked_fsn(self, fsn, timeout_millis): + """ + Wait until the QWP/WebSocket completion watermark reaches ``fsn``. + """ + cdef line_sender_error* err = NULL + cdef PyThreadState* gs = NULL + cdef uint64_t c_fsn + cdef uint64_t c_timeout_millis + cdef cbool reached = False + cdef bint ok = False + + self._check_qwp_ws('await_acked_fsn') + if not isinstance(fsn, int) or isinstance(fsn, bool): + raise TypeError('"fsn" must be a non-negative int.') + if fsn < 0: + raise ValueError('"fsn" must be a non-negative int.') + if not isinstance(timeout_millis, int) or isinstance(timeout_millis, bool): + raise TypeError('"timeout_millis" must be a non-negative int.') + if timeout_millis < 0: + raise ValueError('"timeout_millis" must be a non-negative int.') + c_fsn = fsn + c_timeout_millis = timeout_millis + + _ensure_doesnt_have_gil(&gs) + ok = line_sender_qwpws_await_acked_fsn( + self._impl, c_fsn, c_timeout_millis, &reached, &err) + _ensure_has_gil(&gs) + if not ok: + raise c_err_to_py(err) + return bool(reached) + + def drive_once(self): + """ + Drive one QWP/WebSocket progress step for manual progress senders. + """ + cdef line_sender_error* err = NULL + cdef PyThreadState* gs = NULL + cdef cbool progressed = False + cdef bint ok = False + + self._check_qwp_ws('drive_once') + _ensure_doesnt_have_gil(&gs) + ok = line_sender_qwpws_drive_once(self._impl, &progressed, &err) + _ensure_has_gil(&gs) + if not ok: + raise c_err_to_py(err) + return bool(progressed) + + def poll_qwp_ws_error(self): + """ + Poll the next structured QWP/WebSocket diagnostic. + """ + cdef line_sender_error* err = NULL + cdef line_sender_qwpws_error* qwp_err = NULL + cdef line_sender_qwpws_error_view view + + self._check_qwp_ws('poll_qwp_ws_error') + if not line_sender_qwpws_poll_error(self._impl, &qwp_err, &err): + raise c_err_to_py(err) + if qwp_err == NULL: + return None + try: + view = line_sender_qwpws_error_get_view(qwp_err) + return _qwp_ws_error_from_raw(c_qwp_ws_error_view_to_raw(view)) + finally: + line_sender_qwpws_error_free(qwp_err) + + def qwp_ws_errors_dropped(self): + """ + Number of QWP/WebSocket diagnostics dropped from the bounded ring. + """ + cdef line_sender_error* err = NULL + cdef uint64_t dropped = 0 + + self._check_qwp_ws('qwp_ws_errors_dropped') + if not line_sender_qwpws_errors_dropped(self._impl, &dropped, &err): + raise c_err_to_py(err) + return dropped + + def close_drain(self): + """ + Stop accepting new QWP/WebSocket publications and wait for already + published frames to resolve. + """ + cdef line_sender_error* err = NULL + cdef PyThreadState* gs = NULL + cdef bint ok = False + + self._check_qwp_ws('close_drain') + _ensure_doesnt_have_gil(&gs) + ok = line_sender_qwpws_close_drain(self._impl, &err) + _ensure_has_gil(&gs) + if not ok: + raise c_err_to_py(err) + cdef _close(self): self._buffer = None line_sender_opts_free(self._opts) self._opts = NULL line_sender_close(self._impl) self._impl = NULL + self._qwp_ws_error_handler = None if self._slot_id != -1: qdb_active_senders_track_closed(self._slot_id) self._slot_id = -1 @@ -2747,11 +6815,15 @@ cdef class Sender: Once a sender is closed, it can't be re-used. :param bool flush: If ``True``, flush the internal buffer before closing. + For QWP/WebSocket, this also drains already-published frames before + closing. """ try: if (flush and (self._impl != NULL) and (not line_sender_must_close(self._impl))): self.flush(None, True) + if _is_qwp_ws_protocol(self._c_protocol): + self.close_drain() finally: self._close() @@ -2769,4 +6841,3 @@ cdef class Sender: def __dealloc__(self): self._close() free(self._last_flush_ms) - diff --git a/src/questdb/line_sender.pxd b/src/questdb/line_sender.pxd index ad364d24..97bdeb6f 100644 --- a/src/questdb/line_sender.pxd +++ b/src/questdb/line_sender.pxd @@ -22,7 +22,13 @@ ## ################################################################################ -from libc.stdint cimport int64_t, uint16_t, uint64_t, uint8_t, uint32_t, int32_t +from libc.stdint cimport int64_t, uint16_t, uint64_t, uint8_t, uint32_t, \ + int32_t, int8_t, int16_t + +from .arrow_c_data_interface cimport ArrowArray, ArrowArrayStream, ArrowSchema + +cdef extern from "stdbool.h": + ctypedef unsigned char cbool "bool" cdef extern from "questdb/ingress/line_sender.h": cdef struct line_sender_error: @@ -42,13 +48,22 @@ cdef extern from "questdb/ingress/line_sender.h": line_sender_error_config_error, line_sender_error_array_error, line_sender_error_protocol_version_error, - line_sender_error_invalid_decimal + line_sender_error_invalid_decimal, + line_sender_error_server_rejection, + line_sender_error_arrow_unsupported_column_kind, + line_sender_error_arrow_ingest, + line_sender_error_failover_retry, + line_sender_error_role_mismatch cdef enum line_sender_protocol: line_sender_protocol_tcp, line_sender_protocol_tcps, line_sender_protocol_http, line_sender_protocol_https, + line_sender_protocol_qwpudp, + line_sender_protocol_qwpws, + line_sender_protocol_qwpwss, + line_sender_protocol_unknown, cdef enum line_sender_protocol_version: line_sender_protocol_version_1 = 1, @@ -61,6 +76,47 @@ cdef extern from "questdb/ingress/line_sender.h": line_sender_ca_webpki_and_os_roots, line_sender_ca_pem_file, + cdef enum line_sender_qwpws_progress: + LINE_SENDER_QWPWS_PROGRESS_BACKGROUND, + LINE_SENDER_QWPWS_PROGRESS_MANUAL, + + cdef struct line_sender_qwpws_fsn: + cbool has_value + uint64_t value + + cdef enum line_sender_qwpws_error_category: + LINE_SENDER_QWPWS_ERROR_SCHEMA_MISMATCH, + LINE_SENDER_QWPWS_ERROR_PARSE_ERROR, + LINE_SENDER_QWPWS_ERROR_INTERNAL_ERROR, + LINE_SENDER_QWPWS_ERROR_SECURITY_ERROR, + LINE_SENDER_QWPWS_ERROR_WRITE_ERROR, + LINE_SENDER_QWPWS_ERROR_PROTOCOL_VIOLATION, + LINE_SENDER_QWPWS_ERROR_UNKNOWN, + + cdef enum line_sender_qwpws_error_policy: + LINE_SENDER_QWPWS_ERROR_DROP_AND_CONTINUE, + LINE_SENDER_QWPWS_ERROR_HALT, + + cdef struct line_sender_qwpws_error: + pass + + cdef struct line_sender_qwpws_error_view: + line_sender_qwpws_error_category category + line_sender_qwpws_error_policy applied_policy + cbool has_status + uint8_t status + cbool has_message_sequence + uint64_t message_sequence + uint64_t from_fsn + uint64_t to_fsn + const char* message + size_t message_len + + ctypedef void (*line_sender_qwpws_error_cb)( + void* user_data, + const line_sender_qwpws_error_view* event + ) noexcept with gil + line_sender_error_code line_sender_error_get_code( const line_sender_error* error ) noexcept nogil @@ -138,17 +194,26 @@ cdef extern from "questdb/ingress/line_sender.h": size_t max_name_len ) noexcept nogil + line_sender_buffer* line_sender_buffer_new_qwp( + ) noexcept nogil + + line_sender_buffer* line_sender_buffer_new_qwp_with_max_name_len( + size_t max_name_len + ) noexcept nogil + void line_sender_buffer_free( line_sender_buffer* buffer ) noexcept nogil line_sender_buffer* line_sender_buffer_clone( - const line_sender_buffer* buffer + const line_sender_buffer* buffer, + line_sender_error** err_out ) noexcept nogil - void line_sender_buffer_reserve( + bint line_sender_buffer_reserve( line_sender_buffer* buffer, - size_t additional + size_t additional, + line_sender_error** err_out ) noexcept nogil size_t line_sender_buffer_capacity( @@ -337,6 +402,31 @@ cdef extern from "questdb/ingress/line_sender.h": line_sender_error** err_out ) noexcept nogil + bint line_sender_opts_max_datagram_size( + line_sender_opts* opts, + size_t max_datagram_size, + line_sender_error** err_out + ) noexcept nogil + + bint line_sender_opts_multicast_ttl( + line_sender_opts* opts, + uint32_t multicast_ttl, + line_sender_error** err_out + ) noexcept nogil + + bint line_sender_opts_qwpws_progress( + line_sender_opts* opts, + line_sender_qwpws_progress progress, + line_sender_error** err_out + ) noexcept nogil + + bint line_sender_opts_qwpws_error_handler( + line_sender_opts* opts, + line_sender_qwpws_error_cb cb, + void* user_data, + line_sender_error** err_out + ) noexcept nogil + bint line_sender_opts_username( line_sender_opts* opts, line_sender_utf8 username, @@ -397,6 +487,12 @@ cdef extern from "questdb/ingress/line_sender.h": line_sender_error** err_out ) noexcept nogil + bint line_sender_opts_tls_roots_password( + line_sender_opts* opts, + line_sender_utf8 password, + line_sender_error** err_out + ) noexcept nogil + bint line_sender_opts_max_buf_size( line_sender_opts* opts, size_t max_buf_size, @@ -415,6 +511,12 @@ cdef extern from "questdb/ingress/line_sender.h": line_sender_error** err_out ) noexcept nogil + bint line_sender_opts_retry_max_backoff( + line_sender_opts* opts, + uint64_t millis, + line_sender_error** err_out + ) noexcept nogil + bint line_sender_opts_request_min_throughput( line_sender_opts* opts, uint64_t bytes_per_sec, @@ -428,7 +530,7 @@ cdef extern from "questdb/ingress/line_sender.h": ) noexcept nogil line_sender_opts* line_sender_opts_clone( - const line_sender_opts* opts + line_sender_opts* opts ) noexcept nogil void line_sender_opts_free( @@ -453,6 +555,10 @@ cdef extern from "questdb/ingress/line_sender.h": const line_sender * sender ) noexcept nogil + line_sender_protocol line_sender_get_protocol( + const line_sender * sender + ) noexcept nogil + size_t line_sender_get_max_name_len( const line_sender * sender ) noexcept nogil @@ -469,6 +575,76 @@ cdef extern from "questdb/ingress/line_sender.h": line_sender* sender ) noexcept nogil + bint line_sender_qwpws_flush_and_get_fsn( + line_sender* sender, + line_sender_buffer* buffer, + line_sender_qwpws_fsn* fsn_out, + line_sender_error** err_out + ) noexcept nogil + + bint line_sender_qwpws_flush_and_keep_and_get_fsn( + line_sender* sender, + const line_sender_buffer* buffer, + line_sender_qwpws_fsn* fsn_out, + line_sender_error** err_out + ) noexcept nogil + + bint line_sender_qwpws_drive_once( + line_sender* sender, + cbool* progressed_out, + line_sender_error** err_out + ) noexcept nogil + + bint line_sender_qwpws_published_fsn( + const line_sender* sender, + line_sender_qwpws_fsn* fsn_out, + line_sender_error** err_out + ) noexcept nogil + + bint line_sender_qwpws_acked_fsn( + const line_sender* sender, + line_sender_qwpws_fsn* fsn_out, + line_sender_error** err_out + ) noexcept nogil + + bint line_sender_qwpws_await_acked_fsn( + line_sender* sender, + uint64_t fsn, + uint64_t timeout_millis, + cbool* reached_out, + line_sender_error** err_out + ) noexcept nogil + + bint line_sender_qwpws_poll_error( + line_sender* sender, + line_sender_qwpws_error** error_out, + line_sender_error** err_out + ) noexcept nogil + + line_sender_qwpws_error_view line_sender_qwpws_error_get_view( + const line_sender_qwpws_error* error + ) noexcept nogil + + bint line_sender_error_qwpws_get_view( + const line_sender_error* error, + line_sender_qwpws_error_view* view_out + ) noexcept nogil + + void line_sender_qwpws_error_free( + line_sender_qwpws_error* error + ) noexcept nogil + + bint line_sender_qwpws_errors_dropped( + const line_sender* sender, + uint64_t* dropped_out, + line_sender_error** err_out + ) noexcept nogil + + bint line_sender_qwpws_close_drain( + line_sender* sender, + line_sender_error** err_out + ) noexcept nogil + bint line_sender_flush( line_sender* sender, line_sender_buffer* buffer, @@ -500,4 +676,602 @@ cdef extern from "questdb/ingress/line_sender.h": line_sender_opts* opts, line_sender_utf8 user_agent, line_sender_error** err_out - ) noexcept nogil \ No newline at end of file + ) noexcept nogil + + +cdef extern from "questdb/ingress/column_sender.h": + cdef struct questdb_db: + pass + + cdef struct column_sender: + pass + + cdef struct column_sender_chunk: + pass + + cdef struct column_sender_arrow_import: + pass + + cdef struct column_sender_validity: + const uint8_t* bits + size_t bit_len + + cdef enum column_sender_ack_level: + column_sender_ack_level_ok + column_sender_ack_level_durable + + questdb_db* questdb_db_connect( + const char* conf, + size_t conf_len, + line_sender_error** err_out + ) noexcept nogil + + void questdb_db_close( + questdb_db* db + ) noexcept nogil + + column_sender* questdb_db_borrow_column_sender( + questdb_db* db, + line_sender_error** err_out + ) noexcept nogil + + column_sender* questdb_db_borrow_column_sender_with_retry( + questdb_db* db, + uint64_t budget_ms, + line_sender_error** err_out + ) noexcept nogil + + uint64_t questdb_db_reconnect_max_duration_ms( + const questdb_db* db + ) noexcept nogil + + void questdb_db_return_column_sender( + questdb_db* db, + column_sender* conn + ) noexcept nogil + + void questdb_db_drop_column_sender( + questdb_db* db, + column_sender* conn + ) noexcept nogil + + size_t questdb_db_reap_idle( + questdb_db* db + ) noexcept nogil + + bint column_sender_must_close( + const column_sender* conn + ) noexcept nogil + + column_sender_chunk* column_sender_chunk_new( + const char* table_name, + size_t table_name_len, + line_sender_error** err_out + ) noexcept nogil + + void column_sender_chunk_free( + column_sender_chunk* chunk + ) noexcept nogil + + bint column_sender_chunk_clear( + column_sender_chunk* chunk, + line_sender_error** err_out + ) noexcept nogil + + size_t column_sender_chunk_row_count( + const column_sender_chunk* chunk, + line_sender_error** err_out + ) noexcept nogil + + bint column_sender_chunk_column_bool( + column_sender_chunk* chunk, + const char* name, + size_t name_len, + const uint8_t* data, + size_t row_count, + const column_sender_validity* validity, + line_sender_error** err_out + ) noexcept nogil + + bint column_sender_chunk_column_varchar( + column_sender_chunk* chunk, + const char* name, + size_t name_len, + const int32_t* offsets, + const uint8_t* bytes, + size_t bytes_len, + size_t row_count, + const column_sender_validity* validity, + line_sender_error** err_out + ) noexcept nogil + + bint column_sender_chunk_column_binary( + column_sender_chunk* chunk, + const char* name, + size_t name_len, + const int32_t* offsets, + const uint8_t* bytes, + size_t bytes_len, + size_t row_count, + const column_sender_validity* validity, + line_sender_error** err_out + ) noexcept nogil + + bint column_sender_chunk_designated_timestamp_micros( + column_sender_chunk* chunk, + const int64_t* data, + size_t row_count, + line_sender_error** err_out + ) noexcept nogil + + bint column_sender_chunk_designated_timestamp_nanos( + column_sender_chunk* chunk, + const int64_t* data, + size_t row_count, + line_sender_error** err_out + ) noexcept nogil + + bint column_sender_chunk_designated_timestamp_millis( + column_sender_chunk* chunk, + const int64_t* data, + size_t row_count, + line_sender_error** err_out + ) noexcept nogil + + bint column_sender_chunk_designated_timestamp_seconds( + column_sender_chunk* chunk, + const int64_t* data, + size_t row_count, + line_sender_error** err_out + ) noexcept nogil + + cdef enum column_sender_symbol_mode: + column_sender_symbol_mode_auto = 0 + column_sender_symbol_mode_symbol = 1 + column_sender_symbol_mode_not_symbol = 2 + + column_sender_arrow_import* column_sender_arrow_import_new( + ArrowArray* array, + const ArrowSchema* schema, + column_sender_symbol_mode symbol_mode, + line_sender_error** err_out + ) noexcept nogil + + bint column_sender_chunk_append_arrow_import( + column_sender_chunk* chunk, + const char* name, + size_t name_len, + const column_sender_arrow_import* imported, + size_t row_offset, + size_t row_count, + line_sender_error** err_out + ) noexcept nogil + + void column_sender_arrow_import_free( + column_sender_arrow_import* imported + ) noexcept nogil + + bint column_sender_chunk_append_arrow_column( + column_sender_chunk* chunk, + const char* name, + size_t name_len, + ArrowArray* array, + const ArrowSchema* schema, + size_t row_offset, + size_t row_count, + line_sender_error** err_out + ) noexcept nogil + + cdef enum column_sender_numpy_dtype: + column_sender_numpy_i8 = 0 + column_sender_numpy_i16 = 1 + column_sender_numpy_i32 = 2 + column_sender_numpy_i64 = 3 + column_sender_numpy_u8 = 4 + column_sender_numpy_u16 = 5 + column_sender_numpy_u32 = 6 + column_sender_numpy_u64 = 7 + column_sender_numpy_f32 = 8 + column_sender_numpy_f64 = 9 + column_sender_numpy_bool = 10 + column_sender_numpy_f16 = 11 + column_sender_numpy_datetime64_s = 12 + column_sender_numpy_datetime64_ms = 13 + column_sender_numpy_datetime64_us = 14 + column_sender_numpy_datetime64_ns = 15 + column_sender_numpy_timedelta64_s = 16 + column_sender_numpy_timedelta64_ms = 17 + column_sender_numpy_timedelta64_us = 18 + column_sender_numpy_timedelta64_ns = 19 + column_sender_numpy_s16 = 20 + column_sender_numpy_s32 = 21 + column_sender_numpy_decimal_s8 = 22 + column_sender_numpy_decimal_s16 = 23 + column_sender_numpy_decimal_s32 = 24 + column_sender_numpy_u32_ipv4 = 25 + column_sender_numpy_u16_char = 26 + column_sender_numpy_geohash_i8 = 27 + column_sender_numpy_geohash_i16 = 28 + column_sender_numpy_geohash_i32 = 29 + column_sender_numpy_geohash_i64 = 30 + column_sender_numpy_f64_ndarray = 31 + column_sender_numpy_datetime64_m = 32 + column_sender_numpy_datetime64_h = 33 + column_sender_numpy_datetime64_D = 34 + column_sender_numpy_datetime64_M = 35 + column_sender_numpy_datetime64_Y = 36 + column_sender_numpy_datetime64_W = 37 + column_sender_numpy_timedelta64_m = 38 + column_sender_numpy_timedelta64_h = 39 + column_sender_numpy_timedelta64_D = 40 + column_sender_numpy_timedelta64_M = 41 + column_sender_numpy_timedelta64_Y = 42 + + cdef struct column_sender_numpy_extras: + int8_t decimal_scale + uint8_t geohash_bits + uint8_t array_ndim + const uint32_t* array_shape + + bint column_sender_chunk_append_numpy_column( + column_sender_chunk* chunk, + const char* name, + size_t name_len, + uint32_t dtype, + const uint8_t* data, + size_t data_len_bytes, + size_t row_count, + const column_sender_validity* validity, + const column_sender_numpy_extras* extras, + line_sender_error** err_out + ) noexcept nogil + + bint column_sender_flush( + column_sender* conn, + column_sender_chunk* chunk, + line_sender_error** err_out + ) noexcept nogil + + bint column_sender_sync( + column_sender* conn, + uint32_t ack_level, + line_sender_error** err_out + ) noexcept nogil + + cdef enum column_sender_arrow_override_kind: + column_sender_arrow_override_symbol = 0 + column_sender_arrow_override_ipv4 = 1 + column_sender_arrow_override_char = 2 + column_sender_arrow_override_geohash = 3 + + cdef struct column_sender_arrow_override: + const char* column + size_t column_len + uint32_t kind + uint32_t arg + + bint column_sender_flush_arrow_batch_server_stamped( + column_sender* conn, + line_sender_table_name table, + ArrowArray* array, + const ArrowSchema* schema, + const column_sender_arrow_override* overrides, + size_t overrides_len, + line_sender_error** err_out + ) noexcept nogil + + bint column_sender_flush_arrow_batch_at_column( + column_sender* conn, + line_sender_table_name table, + ArrowArray* array, + const ArrowSchema* schema, + line_sender_column_name ts_column, + const column_sender_arrow_override* overrides, + size_t overrides_len, + line_sender_error** err_out + ) noexcept nogil + + +cdef extern from "questdb/egress/reader.h": + cdef struct reader: + pass + + cdef struct reader_query: + pass + + cdef struct reader_cursor: + pass + + cdef struct reader_error: + pass + + cdef enum reader_error_code: + reader_error_could_not_resolve_addr = 0 + reader_error_config_error = 1 + reader_error_invalid_api_call = 2 + reader_error_socket_error = 3 + reader_error_tls_error = 4 + reader_error_handshake_error = 5 + reader_error_auth_error = 6 + reader_error_unsupported_server = 7 + reader_error_role_mismatch = 8 + reader_error_protocol_error = 9 + reader_error_invalid_utf8 = 10 + reader_error_invalid_bind = 11 + reader_error_server_schema_mismatch = 14 + reader_error_server_parse_error = 15 + reader_error_server_internal_error = 16 + reader_error_server_security_error = 17 + reader_error_limit_exceeded = 18 + reader_error_server_limit_exceeded = 19 + reader_error_cancelled = 20 + reader_error_failover_would_duplicate = 21 + reader_error_schema_drift = 22 + reader_error_no_schema = 23 + reader_error_arrow_export = 24 + + cdef enum reader_arrow_batch_result: + reader_arrow_batch_ok = 0 + reader_arrow_batch_end = 1 + reader_arrow_batch_error = 2 + + reader_error_code reader_error_get_code( + const reader_error* error + ) noexcept nogil + + const char* reader_error_msg( + const reader_error* error, + size_t* len_out + ) noexcept nogil + + void reader_error_free( + reader_error* error + ) noexcept nogil + + reader* reader_from_conf( + line_sender_utf8 config, + reader_error** err_out + ) noexcept nogil + + void reader_close( + reader* reader + ) noexcept nogil + + reader_query* reader_prepare( + reader* reader, + line_sender_utf8 sql, + reader_error** err_out + ) noexcept nogil + + void reader_query_free( + reader_query* query + ) noexcept nogil + + reader_cursor* reader_query_execute( + reader_query** query_inout, + reader_error** err_out + ) noexcept nogil + + reader_cursor* reader_execute( + reader* reader, + line_sender_utf8 sql, + reader_error** err_out + ) noexcept nogil + + cdef struct reader_failover_event: + pass + + ctypedef void (*reader_failover_callback)( + const reader_failover_event* event, + void* user_data) noexcept nogil + + void reader_failover_event_failed_host( + const reader_failover_event* event, + const char** out_buf, + size_t* out_len + ) noexcept nogil + + uint16_t reader_failover_event_failed_port( + const reader_failover_event* event + ) noexcept nogil + + void reader_failover_event_new_host( + const reader_failover_event* event, + const char** out_buf, + size_t* out_len + ) noexcept nogil + + uint16_t reader_failover_event_new_port( + const reader_failover_event* event + ) noexcept nogil + + int64_t reader_failover_event_new_request_id( + const reader_failover_event* event + ) noexcept nogil + + uint32_t reader_failover_event_attempts( + const reader_failover_event* event + ) noexcept nogil + + uint64_t reader_failover_event_elapsed_ns( + const reader_failover_event* event + ) noexcept nogil + + reader_error_code reader_failover_event_trigger_code( + const reader_failover_event* event + ) noexcept nogil + + void reader_failover_event_trigger_msg( + const reader_failover_event* event, + const char** out_buf, + size_t* out_len + ) noexcept nogil + + void reader_query_on_failover_reset( + reader_query* query, + reader_failover_callback callback, + void* user_data + ) noexcept nogil + + void reader_cursor_free( + reader_cursor* cursor + ) noexcept nogil + + bint reader_cursor_cancel( + reader_cursor* cursor, + reader_error** err_out + ) noexcept nogil + + reader_arrow_batch_result reader_cursor_next_arrow_batch( + reader_cursor* cursor, + ArrowArray* out_array, + ArrowSchema* out_schema, + reader_error** err_out + ) noexcept nogil + + cdef enum reader_column_kind: + reader_column_kind_boolean = 0x01 + reader_column_kind_byte = 0x02 + reader_column_kind_short = 0x03 + reader_column_kind_int = 0x04 + reader_column_kind_long = 0x05 + reader_column_kind_float = 0x06 + reader_column_kind_double = 0x07 + reader_column_kind_symbol = 0x09 + reader_column_kind_timestamp = 0x0A + reader_column_kind_date = 0x0B + reader_column_kind_uuid = 0x0C + reader_column_kind_long256 = 0x0D + reader_column_kind_geohash = 0x0E + reader_column_kind_varchar = 0x0F + reader_column_kind_timestamp_nanos = 0x10 + reader_column_kind_double_array = 0x11 + reader_column_kind_long_array = 0x12 + reader_column_kind_decimal64 = 0x13 + reader_column_kind_decimal128 = 0x14 + reader_column_kind_decimal256 = 0x15 + reader_column_kind_char = 0x16 + reader_column_kind_binary = 0x17 + reader_column_kind_ipv4 = 0x18 + reader_column_kind_unknown = 0xFF + + cdef struct reader_batch: + pass + + cdef struct reader_column_data: + reader_column_kind kind + size_t row_count + const uint8_t* validity + const void* values + size_t value_stride + const uint32_t* var_offsets + const uint8_t* var_data + size_t var_data_len + const uint32_t* symbol_codes + int8_t decimal_scale + uint8_t geohash_precision_bits + + cdef struct reader_array_data: + reader_column_kind kind + size_t row_count + const uint8_t* validity + const uint8_t* data + size_t data_len + const uint32_t* data_offsets + const uint32_t* shapes + size_t shapes_len + const uint32_t* shape_offsets + + cdef struct reader_symbol_entry: + uint32_t offset + uint32_t length + + cdef struct reader_symbol_dict: + size_t entry_count + const uint8_t* heap + size_t heap_len + const reader_symbol_entry* entries + + const reader_batch* reader_cursor_next_batch( + reader_cursor* cursor, + reader_error** err_out + ) noexcept nogil + + size_t reader_batch_row_count( + const reader_batch* batch + ) noexcept nogil + + size_t reader_batch_column_count( + const reader_batch* batch + ) noexcept nogil + + bint reader_batch_column_kind( + const reader_batch* batch, + size_t col_idx, + reader_column_kind* out_kind, + reader_error** err_out + ) noexcept nogil + + bint reader_batch_column_name( + const reader_batch* batch, + size_t col_idx, + const char** out_buf, + size_t* out_len, + reader_error** err_out + ) noexcept nogil + + bint reader_batch_column_data( + const reader_batch* batch, + size_t col_idx, + reader_column_data* out, + reader_error** err_out + ) noexcept nogil + + bint reader_batch_array_column_data( + const reader_batch* batch, + size_t col_idx, + reader_array_data* out, + reader_error** err_out + ) noexcept nogil + + bint reader_batch_symbol_dict( + const reader_batch* batch, + reader_symbol_dict* out, + reader_error** err_out + ) noexcept nogil + + bint reader_batch_symbol( + const reader_batch* batch, + size_t col_idx, + uint32_t code, + const char** out_buf, + size_t* out_len, + reader_error** err_out + ) noexcept nogil + + void reader_mark_must_close( + reader* reader + ) noexcept nogil + + # Reader-pool entry points. Same FFI surface as questdb_db_*_column_sender + # but for reader handles. Live here (alongside reader) + # because they wrap/unwrap reader instances; the questdb_db + # opaque is forward-declared from the column_sender extern block + # above. + reader* questdb_db_borrow_reader( + questdb_db* db, + reader_error** err_out + ) noexcept nogil + + void questdb_db_return_reader( + questdb_db* db, + reader* reader + ) noexcept nogil + + size_t questdb_db_dbg_reader_free_count( + questdb_db* db + ) noexcept nogil + + size_t questdb_db_dbg_reader_in_use_count( + questdb_db* db + ) noexcept nogil diff --git a/src/questdb/mpdecimal_compat.pxd b/src/questdb/mpdecimal_compat.pxd index 5e906fe7..2157bb22 100644 --- a/src/questdb/mpdecimal_compat.pxd +++ b/src/questdb/mpdecimal_compat.pxd @@ -1,13 +1,14 @@ -from libc.stdint cimport uint8_t, uint32_t +from libc.stdint cimport uint8_t, uint32_t, uint64_t, int64_t from libc.stddef cimport size_t from cpython.object cimport PyObject from .rpyutils cimport * # Mirror the subset of libmpdec types that CPython embeds in Decimal objects. -ctypedef size_t mpd_uint_t -ctypedef Py_ssize_t mpd_ssize_t - +# Widths are platform-dependent, so the header's conditional typedefs win. cdef extern from "mpdecimal_compat.h": + ctypedef uint64_t mpd_uint_t + ctypedef int64_t mpd_ssize_t + ctypedef struct mpd_t: uint8_t flags mpd_ssize_t exp @@ -49,7 +50,11 @@ cdef inline int decimal_pyobj_to_binary( if mpd.exp >= 0: # Decimal ILP does not support negative scales; adjust the unscaled value instead. - exp = mpd.exp + if mpd.exp > 76: + raise ingress_error_cls( + bad_dataframe_code, + f'Decimal exponent {mpd.exp} exceeds the maximum supported value of 76') + exp = mpd.exp scale[0] = 0 else: exp = 0 @@ -59,7 +64,7 @@ cdef inline int decimal_pyobj_to_binary( f'Decimal scale {-mpd.exp} exceeds the maximum supported scale of 76') scale[0] = -mpd.exp - if not qdb_mpd_to_bigendian(digits_ptr, mpd.len, MPD_RADIX, exp, (flag_low & MPD_FLAG_SIGN) != 0, unscaled, &out_size): + if not qdb_mpd_to_bigendian(digits_ptr, mpd.len, MPD_RADIX, exp, (flag_low & MPD_FLAG_SIGN) != 0, unscaled, &out_size): raise ingress_error_cls( bad_dataframe_code, 'Decimal mantissa too large; maximum supported size is 32 bytes.') diff --git a/test/benchmark_pandas_columnar.py b/test/benchmark_pandas_columnar.py new file mode 100644 index 00000000..9ec0019f --- /dev/null +++ b/test/benchmark_pandas_columnar.py @@ -0,0 +1,1530 @@ +#!/usr/bin/env python3 + +import argparse +import gc +import json +import os +import platform +import statistics +import subprocess +import sys +import time +import urllib.parse +import urllib.request + +sys.dont_write_bytecode = True + +import numpy as np +import pandas as pd + +try: + import pyarrow as pa +except ImportError: + pa = None + +import patch_path +import questdb.ingress as qi +from qwp_ws_ack_server import QwpAckServer + + +def _env_int(name, default): + """Read an int knob from the environment, falling back to ``default``. + + Mirrors the Rust column-sender suite knob names (plan s3.3) so a single + environment can drive both clients. + """ + raw = os.environ.get(name) + if raw is None or raw == "": + return default + return int(raw) + + +def git_rev(path): + try: + return subprocess.check_output( + ["git", "rev-parse", "HEAD"], + cwd=path, + text=True, + stderr=subprocess.DEVNULL, + ).strip() + except Exception: + return None + + +def execute_sql(http_base, sql): + if not http_base: + raise ValueError("--real-http is required when SQL hooks are used") + query = urllib.parse.urlencode({"query": sql}) + url = http_base.rstrip("/") + "/exec?" + query + with urllib.request.urlopen(url, timeout=60) as response: + body = response.read().decode("utf-8", errors="replace") + return { + "status": response.status, + "body": body, + } + + +def execute_sqls(http_base, sqls): + return [execute_sql(http_base, sql) for sql in sqls] + + +def strip_conf_keys(conf, keys): + if "::" not in conf: + return conf + prefix, rest = conf.split("::", 1) + kept = [] + for item in rest.split(";"): + if not item: + continue + key = item.split("=", 1)[0] + if key not in keys: + kept.append(item) + return prefix + "::" + "".join(f"{item};" for item in kept) + + +# QuestDB's designated TIMESTAMP is microsecond resolution, so the generated +# datetime64[ns] values must be spaced at least 1 microsecond (1000 ns) apart +# to stay distinct once stored. Nanosecond-spaced timestamps collapse to ~1000 +# distinct microseconds, which DEDUP UPSERT KEYS(ts) then folds to ~1000 rows +# (breaking the count() == rows invariant, plan s3.4). +_TS_STEP_NS = np.int64(1000) + + +def make_timestamp_series(rows): + base = np.int64(1_704_067_200_000_000_000) + values = base + np.arange(rows, dtype=np.int64) * _TS_STEP_NS + return pd.Series(values.view("datetime64[ns]")) + + +# Defaults mirror the Rust column-sender suite (COLUMN_SENDER_PERF.md): the +# headline S1 schema uses a low-cardinality symbol (card 8) and a short +# (~16 byte) varchar so the numbers line up cross-client. +DEFAULT_SYM_CARD = 8 +DEFAULT_VARCHAR_LEN = 16 +# S2-wide high-cardinality SYMBOL columns (s1..s5): default matches the Go +# qwp-egress-read-wide anchor (100k distinct/col, uniform). Pass a length-5 +# sequence instead for the plan's 10k-100k spread (dict-scale characterisation). +DEFAULT_HI_SYM_CARD = 100_000 +S2_SPREAD_HI_SYM_CARD = (10_000, 25_000, 50_000, 75_000, 100_000) + + +def _build_note_series(rows, varchar_len, varchar_charset): + """VARCHAR ``note`` column shared by S1/S2 (plan s3.1). + + Fixed-width ~``varchar_len`` notes from a low-cardinality rotating template + (neither the numpy nor the Arrow egress path dedups a plain VARCHAR, so + low-card text flatters neither). ``varchar_charset="unicode"`` shifts every + codepoint into Latin Extended-A (U+0100+): same per-index distinctness and + codepoint count as ascii, but every codepoint is non-ASCII (2 UTF-8 bytes), + so the numpy to_pandas loop (``PyUnicode_FromStringAndSize`` per row) cannot + take CPython's ASCII fast path and must build wider (UCS-2) str objects. + rows/s stays the apples-to-apples metric (unicode is ~2x the on-wire bytes + for the same row/codepoint count). + """ + if pa is None: + raise RuntimeError("pyarrow is not installed") + if varchar_len < 1: + raise ValueError("--varchar-len must be at least 1") + if varchar_charset not in ("ascii", "unicode"): + raise ValueError("varchar_charset must be 'ascii' or 'unicode'") + ascii_templates = [ + (f"note_{index:03}_" * varchar_len)[:varchar_len] + for index in range(min(rows, 1024) or 1)] + if varchar_charset == "unicode": + note_templates = [ + "".join(chr(ord(ch) + 0x100) for ch in tmpl) + for tmpl in ascii_templates] + else: + note_templates = ascii_templates + notes = [note_templates[index % len(note_templates)] + for index in range(rows)] + return pd.Series( + pa.array(notes, type=pa.string()), dtype=pd.ArrowDtype(pa.string())) + + +def make_s1_narrow(rows, *, sym_card=DEFAULT_SYM_CARD, + varchar_len=DEFAULT_VARCHAR_LEN, + varchar_charset="ascii"): + """S1 headline schema (QWP_DATAFRAME_BENCH_PLAN.md s3.1). + + 5 columns matching the Go/Rust ``qwp-egress-read`` narrow schema so the + cross-client parity table lines up: + + * ``ts`` -> TIMESTAMP (designated), ``datetime64[ns]``, monotonic-unique + * ``id`` -> LONG, ``int64`` + * ``price`` -> DOUBLE, ``float64`` + * ``sym`` -> SYMBOL, pandas ``Categorical`` (cardinality ``sym_card``) + * ``note`` -> VARCHAR, Arrow-backed string of length ~``varchar_len`` + (``varchar_charset="ascii"`` default; ``"unicode"`` for + non-ASCII content that defeats the numpy ASCII fast path) + + ``ts`` is monotonic and unique *at microsecond resolution* (the designated + TIMESTAMP precision), so the DEDUP ``UPSERT KEYS(ts)`` table can assert + ``count() == rows`` even though QWP/WS is at-least-once on reconnect + (see plan s3.4 and ``make_timestamp_series``). + """ + if pa is None: + raise RuntimeError("pyarrow is not installed") + if sym_card < 1: + raise ValueError("--sym-card must be at least 1") + indexes = np.arange(rows, dtype=np.int64) + symbols = np.array([f"sym_{index:04}" for index in range(sym_card)]) + return pd.DataFrame({ + "ts": make_timestamp_series(rows), + "id": pd.Series(indexes, dtype=np.int64), + "price": pd.Series(indexes.astype(np.float64) * 0.25), + "sym": pd.Categorical(symbols[indexes % len(symbols)]), + "note": _build_note_series(rows, varchar_len, varchar_charset), + }) + + +def make_s2_wide(rows, *, sym_card=DEFAULT_SYM_CARD, + varchar_len=DEFAULT_VARCHAR_LEN, varchar_charset="ascii", + hi_sym_card=DEFAULT_HI_SYM_CARD): + """S2 wide schema (QWP_DATAFRAME_BENCH_PLAN.md s8), matching the Go + ``qwp-egress-read-wide`` anchor so the wide parity number lines up: it is + S1-narrow plus 5 DOUBLE and 5 high-cardinality SYMBOL columns (15 total). + + * ``ts``/``id``/``price``/``sym``/``note`` -> identical to S1-narrow + (``sym`` stays low-cardinality, ``card sym_card``) + * ``d1``..``d5`` -> DOUBLE, ``float64`` (widen the fixed-width payload) + * ``s1``..``s5`` -> SYMBOL, pandas ``Categorical``, high cardinality + + ``hi_sym_card`` sets the cardinality of ``s1``..``s5``: an int applies + uniformly (default 100k, the anchor) or a length-5 sequence gives a spread + (``S2_SPREAD_HI_SYM_CARD`` = 10k-100k). The 5 high-card SYMBOLs are the + connection-scoped delta-dict stress (plan s3.5); the extra DOUBLEs plus the + wider row are the "QWP wins on wide rows" axis. + """ + if pa is None: + raise RuntimeError("pyarrow is not installed") + if sym_card < 1: + raise ValueError("--sym-card must be at least 1") + cards = ([int(hi_sym_card)] * 5 if isinstance(hi_sym_card, int) + else [int(c) for c in hi_sym_card]) + if len(cards) != 5 or any(c < 1 for c in cards): + raise ValueError( + "hi_sym_card must be a positive int or 5 positive ints") + indexes = np.arange(rows, dtype=np.int64) + symbols = np.array([f"sym_{index:04}" for index in range(sym_card)]) + cols = { + "ts": make_timestamp_series(rows), + "id": pd.Series(indexes, dtype=np.int64), + "price": pd.Series(indexes.astype(np.float64) * 0.25), + "sym": pd.Categorical(symbols[indexes % len(symbols)]), + "note": _build_note_series(rows, varchar_len, varchar_charset), + } + for d in range(1, 6): + cols[f"d{d}"] = pd.Series(indexes.astype(np.float64) * (0.5 + d)) + # from_codes avoids materialising rows*5 symbol strings: codes = index mod + # card, categories built once per column. + for i, card in enumerate(cards): + codes = (indexes % card).astype(np.int32) + cats = pd.Index([f"s{i}_{v:06d}" for v in range(card)], dtype="object") + cols[f"s{i + 1}"] = pd.Categorical.from_codes(codes, categories=cats) + return pd.DataFrame(cols) + + +def make_numeric_core(rows): + return pd.DataFrame({ + "ts": make_timestamp_series(rows), + "seq": pd.Series(np.arange(rows, dtype=np.int64)), + "price": pd.Series(np.arange(rows, dtype=np.float64) * 0.25), + "qty": pd.Series((np.arange(rows, dtype=np.int64) % 1_000) + 1), + }) + + +def make_numeric_wide(rows): + data = {"ts": make_timestamp_series(rows)} + base_i = np.arange(rows, dtype=np.int64) + base_f = np.arange(rows, dtype=np.float64) + for index in range(8): + data[f"i{index:02}"] = pd.Series(base_i + index) + for index in range(8): + data[f"f{index:02}"] = pd.Series(base_f * 0.25 + index) + return pd.DataFrame(data) + + +def make_categorical_symbols(rows): + symbols = np.array([f"sym_{index:04}" for index in range(1000)]) + venues = np.array([f"venue_{index:02}" for index in range(16)]) + indexes = np.arange(rows, dtype=np.int64) + return pd.DataFrame({ + "ts": make_timestamp_series(rows), + "symbol": pd.Categorical(symbols[indexes % len(symbols)]), + "venue": pd.Categorical(venues[indexes % len(venues)]), + "price": pd.Series(indexes.astype(np.float64) * 0.25), + "qty": pd.Series((indexes % 1_000) + 1, dtype=np.int64), + }) + + +def make_arrow_strings(rows): + if pa is None: + raise RuntimeError("pyarrow is not installed") + indexes = np.arange(rows, dtype=np.int64) + messages = [f"message_{index % 1024:04}" for index in range(rows)] + payloads = [ + f"payload_{index % 1024:04}_{index % 31:02}_{index % 127:03}" + for index in range(rows)] + return pd.DataFrame({ + "ts": make_timestamp_series(rows), + "id": pd.Series(indexes, dtype=np.int64), + "message": pd.Series( + pa.array(messages, type=pa.string()), + dtype=pd.ArrowDtype(pa.string())), + "payload": pd.Series( + pa.array(payloads, type=pa.string()), + dtype=pd.ArrowDtype(pa.string())), + }) + + +def make_arrow_large_strings(rows): + if pa is None: + raise RuntimeError("pyarrow is not installed") + values = [f"label_{index % 1024:04}" for index in range(rows)] + return pd.DataFrame({ + "ts": make_timestamp_series(rows), + "label": pd.Series( + pa.array(values, type=pa.large_string()), + dtype=pd.ArrowDtype(pa.large_string())), + "seq": pd.Series(np.arange(rows, dtype=np.int64)), + "price": pd.Series(np.arange(rows, dtype=np.float64) * 0.25), + }) + + +def make_mixed_physical(rows): + if pa is None: + raise RuntimeError("pyarrow is not installed") + symbols = np.array([f"sym_{index:04}" for index in range(1000)]) + venues = np.array([f"venue_{index:02}" for index in range(16)]) + indexes = np.arange(rows, dtype=np.int64) + notes = [f"note_{index % 1024:04}_{index % 31:02}" for index in range(rows)] + return pd.DataFrame({ + "ts": make_timestamp_series(rows), + "seq": pd.Series(indexes, dtype=np.int64), + "price": pd.Series(indexes.astype(np.float64) * 0.25), + "qty": pd.Series((indexes % 1_000) + 1, dtype=np.int64), + "symbol": pd.Categorical(symbols[indexes % len(symbols)]), + "venue": pd.Categorical(venues[indexes % len(venues)]), + "note": pd.Series( + pa.array(notes, type=pa.string()), + dtype=pd.ArrowDtype(pa.string())), + }) + + +def make_nullable_extension(rows): + return pd.DataFrame({ + "ts": make_timestamp_series(rows), + "seq": pd.Series(np.arange(rows, dtype=np.int64), dtype="Int64"), + "price": pd.Series( + np.arange(rows, dtype=np.float64) * 0.25, + dtype="Float64"), + "active": pd.Series( + np.arange(rows, dtype=np.int64) % 2 == 0, + dtype="boolean"), + }) + + +def make_bool_unsigned_decision(rows): + return pd.DataFrame({ + "ts": make_timestamp_series(rows), + "active": pd.Series(np.arange(rows, dtype=np.int64) % 2 == 0), + "u8": pd.Series(np.arange(rows, dtype=np.uint8)), + "u16": pd.Series(np.arange(rows, dtype=np.uint16)), + "u32": pd.Series(np.arange(rows, dtype=np.uint32)), + "u64": pd.Series(np.arange(rows, dtype=np.uint64)), + }) + + +def make_unsupported_object(rows): + return pd.DataFrame({ + "ts": make_timestamp_series(rows), + "name": pd.Series( + [f"name_{index % 1024}" for index in range(rows)], + dtype=object), + "qty": pd.Series( + [int(index % 1000) for index in range(rows)], + dtype=object), + "price": pd.Series( + [float(index) * 0.25 for index in range(rows)], + dtype=object), + }) + + +SUPPORTED_SCHEMAS = { + "arrow-large-strings": make_arrow_large_strings, + "arrow-strings": make_arrow_strings, + "categorical-symbols": make_categorical_symbols, + "mixed-physical": make_mixed_physical, + "numeric-core": make_numeric_core, + "numeric-wide": make_numeric_wide, + "s1-narrow": make_s1_narrow, + "s2-wide": make_s2_wide, +} + +# Schemas whose generator accepts the --sym-card / --varchar-len knobs +# (s2-wide additionally accepts --hi-sym-card). +KNOB_SCHEMAS = frozenset({"s1-narrow", "s2-wide"}) + + +def build_schema_df(schema_name, rows, *, sym_card=DEFAULT_SYM_CARD, + varchar_len=DEFAULT_VARCHAR_LEN, + varchar_charset="ascii", + hi_sym_card=DEFAULT_HI_SYM_CARD): + """Build a benchmark DataFrame, threading knobs to schemas that accept them. + + Most generators take only ``rows``; the S1/S2 schemas additionally accept + ``sym_card`` / ``varchar_len`` / ``varchar_charset`` (and ``hi_sym_card`` + for s2-wide's high-cardinality s1..s5). Keeping the registry uniform lets + every call site build any schema without special-casing. + """ + generator = SCHEMAS[schema_name] + if schema_name in KNOB_SCHEMAS: + kwargs = dict(sym_card=sym_card, varchar_len=varchar_len, + varchar_charset=varchar_charset) + if schema_name == "s2-wide": + kwargs["hi_sym_card"] = hi_sym_card + return generator(rows, **kwargs) + return generator(rows) + +REJECTION_SCHEMAS = { + "bool-unsigned-decision": make_bool_unsigned_decision, + "nullable-extension": make_nullable_extension, + "unsupported-object": make_unsupported_object, +} + +SCHEMAS = dict(SUPPORTED_SCHEMAS) +SCHEMAS.update(REJECTION_SCHEMAS) + + +SCHEMA_CREATE_SQL = { + "arrow-large-strings": """ +CREATE TABLE {table} ( + label VARCHAR, + seq LONG, + price DOUBLE, + ts TIMESTAMP +) TIMESTAMP(ts) PARTITION BY DAY WAL +""", + "arrow-strings": """ +CREATE TABLE {table} ( + id LONG, + message VARCHAR, + payload VARCHAR, + ts TIMESTAMP +) TIMESTAMP(ts) PARTITION BY DAY WAL +""", + "bool-unsigned-decision": """ +CREATE TABLE {table} ( + active BOOLEAN, + u8 LONG, + u16 LONG, + u32 LONG, + u64 LONG, + ts TIMESTAMP +) TIMESTAMP(ts) PARTITION BY DAY WAL +""", + "categorical-symbols": """ +CREATE TABLE {table} ( + symbol SYMBOL, + venue SYMBOL, + price DOUBLE, + qty LONG, + ts TIMESTAMP +) TIMESTAMP(ts) PARTITION BY DAY WAL +""", + "mixed-physical": """ +CREATE TABLE {table} ( + seq LONG, + price DOUBLE, + qty LONG, + symbol SYMBOL, + venue SYMBOL, + note VARCHAR, + ts TIMESTAMP +) TIMESTAMP(ts) PARTITION BY DAY WAL +""", + "nullable-extension": """ +CREATE TABLE {table} ( + seq LONG, + price DOUBLE, + active BOOLEAN, + ts TIMESTAMP +) TIMESTAMP(ts) PARTITION BY DAY WAL +""", + "numeric-core": """ +CREATE TABLE {table} ( + seq LONG, + price DOUBLE, + qty LONG, + ts TIMESTAMP +) TIMESTAMP(ts) PARTITION BY DAY WAL +""", + "numeric-wide": """ +CREATE TABLE {table} ( + i00 LONG, i01 LONG, i02 LONG, i03 LONG, + i04 LONG, i05 LONG, i06 LONG, i07 LONG, + f00 DOUBLE, f01 DOUBLE, f02 DOUBLE, f03 DOUBLE, + f04 DOUBLE, f05 DOUBLE, f06 DOUBLE, f07 DOUBLE, + ts TIMESTAMP +) TIMESTAMP(ts) PARTITION BY DAY WAL +""", + # Headline S1 schema. DEDUP UPSERT KEYS(ts) + monotonic-unique ts keeps + # count() == rows even though QWP/WS replays frames on reconnect + # (at-least-once inflates 5-16%; see plan s3.4). + "s1-narrow": """ +CREATE TABLE {table} ( + id LONG, + price DOUBLE, + sym SYMBOL, + note VARCHAR, + ts TIMESTAMP +) TIMESTAMP(ts) PARTITION BY HOUR WAL DEDUP UPSERT KEYS(ts) +""", + # S2 wide schema (plan s8), matching the Go qwp-egress-read-wide anchor: + # S1-narrow + 5 DOUBLE + 5 high-cardinality SYMBOL (CAPACITY 200000 fits the + # 100k distinct values/col with slack). DEDUP added (harness requirement, + # plan s3.4) on top of the anchor's column layout. + "s2-wide": """ +CREATE TABLE {table} ( + id LONG, + price DOUBLE, + sym SYMBOL, + note VARCHAR, + d1 DOUBLE, d2 DOUBLE, d3 DOUBLE, d4 DOUBLE, d5 DOUBLE, + s1 SYMBOL CAPACITY 200000, s2 SYMBOL CAPACITY 200000, + s3 SYMBOL CAPACITY 200000, s4 SYMBOL CAPACITY 200000, + s5 SYMBOL CAPACITY 200000, + ts TIMESTAMP +) TIMESTAMP(ts) PARTITION BY HOUR WAL DEDUP UPSERT KEYS(ts) +""", + "unsupported-object": """ +CREATE TABLE {table} ( + name VARCHAR, + qty LONG, + price DOUBLE, + ts TIMESTAMP +) TIMESTAMP(ts) PARTITION BY DAY WAL +""", +} + + +def percentile(sorted_values, pct): + if not sorted_values: + return None + index = int(round((len(sorted_values) - 1) * pct)) + return sorted_values[index] + + +def summarize(samples_ns): + samples = [sample / 1_000_000_000 for sample in samples_ns] + samples_sorted = sorted(samples) + mean = statistics.fmean(samples) + stdev = statistics.stdev(samples) if len(samples) > 1 else 0.0 + return { + "iterations": len(samples), + "median_s": statistics.median(samples), + "mean_s": mean, + "min_s": samples_sorted[0], + "max_s": samples_sorted[-1], + "p95_s": percentile(samples_sorted, 0.95), + "stdev_s": stdev, + "cov": stdev / mean if mean else 0.0, + } + + +def timed_call(fn): + gc.collect() + was_enabled = gc.isenabled() + gc.disable() + try: + cpu_start = time.process_time_ns() + start = time.perf_counter_ns() + result = fn() + end = time.perf_counter_ns() + cpu_end = time.process_time_ns() + finally: + if was_enabled: + gc.enable() + return end - start, cpu_end - cpu_start, result + + +def run_row_path(df, rows, iterations, warmups): + buf = qi.Buffer.qwp() + + def once(): + buf.clear() + buf.dataframe(df, table_name="bench_numeric", at="ts") + return {"encoded_bytes": len(buf)} + + for _ in range(warmups): + once() + + samples = [] + cpu_samples = [] + last = None + for _ in range(iterations): + elapsed, cpu_elapsed, last = timed_call(once) + samples.append(elapsed) + cpu_samples.append(cpu_elapsed) + return samples, cpu_samples, last + + +def _make_ack_conf(server): + return ( + f"qwpws::addr=127.0.0.1:{server.port};" + "pool_size=1;" + "pool_max=1;" + "pool_reap=manual;") + + +def _finish_columnar_io_stats(timed_calls): + stats = dict(qi._debug_dataframe_columnar_io_stats(enabled=False)) + if timed_calls: + stats["flush_s_per_call"] = stats["flush_s"] / timed_calls + stats["sync_s_per_call"] = stats["sync_s"] / timed_calls + stats["flush_calls_per_call"] = ( + stats["flush_calls"] / timed_calls) + stats["sync_calls_per_call"] = stats["sync_calls"] / timed_calls + else: + stats["flush_s_per_call"] = None + stats["sync_s_per_call"] = None + stats["flush_calls_per_call"] = None + stats["sync_calls_per_call"] = None + return stats + + +def run_client_ack( + df, + rows, + iterations, + warmups, + *, + min_calls=0, + max_seconds=None, + ack_delay_s=0.0): + samples = [] + cpu_samples = [] + last = None + with QwpAckServer(ack_delay_s=ack_delay_s) as server: + conf = _make_ack_conf(server) + with qi.Client.from_conf(conf) as client: + qi._debug_dataframe_columnar_io_stats(enabled=False, reset=True) + for _ in range(warmups): + client.dataframe(df, table_name="bench_numeric", at="ts") + + qi._debug_dataframe_columnar_io_stats(enabled=True, reset=True) + try: + start = time.perf_counter() + for _ in range(iterations): + elapsed, cpu_elapsed, _ = timed_call( + lambda: client.dataframe( + df, + table_name="bench_numeric", + at="ts")) + samples.append(elapsed) + cpu_samples.append(cpu_elapsed) + total_s = time.perf_counter() - start + finally: + columnar_io_stats = _finish_columnar_io_stats(iterations) + + stats = server.snapshot() + reconnects_after_first = max(0, stats["accepted_connections"] - 1) + if reconnects_after_first: + raise AssertionError( + "pooled Client opened extra physical connections: " + f"{stats['accepted_connections']} accepts") + if stats["errors"]: + raise AssertionError( + "ACK server observed errors: " + "; ".join(stats["errors"])) + if min_calls and iterations < min_calls: + raise AssertionError( + f"client-ack-reuse requires at least {min_calls} timed calls, " + f"got {iterations}") + if max_seconds is not None and iterations >= min_calls: + if total_s > max_seconds: + raise AssertionError( + f"{iterations} Client.dataframe calls took " + f"{total_s:.3f}s, over {max_seconds:.3f}s") + last = { + "ack_server": stats, + "ack_delay_s": ack_delay_s, + "columnar_io_stats": columnar_io_stats, + "pool_conf": conf, + "reconnects_after_first": reconnects_after_first, + "timed_calls": iterations, + "total_calls": iterations + warmups, + "timed_total_s": total_s, + "rows_ingested": rows * iterations, + } + return samples, cpu_samples, last + + +def run_cold_warm_split(df, rows, warm_iters, *, ack_delay_s=0.0): + """Measure the cold first-flush vs warm steady-state on one connection. + + The cold/warm axis is the symbol delta-dict + commit mode (plan s3.5): + the first frame on a fresh connection sends the full symbol dict from id 0 + with an immediate commit (warming the server cache); later frames on the + same pooled connection send deltas with deferred commit. ``first_frame_sent`` + travels with the pool slot, so this runs with **zero warmups** to capture + the genuine cold flush, then ``warm_iters`` warm flushes on the same slot. + """ + cold_sample = None + cold_cpu = None + warm_samples = [] + warm_cpu = [] + with QwpAckServer(ack_delay_s=ack_delay_s) as server: + conf = _make_ack_conf(server) + with qi.Client.from_conf(conf) as client: + qi._debug_dataframe_columnar_io_stats(enabled=True, reset=True) + + def once(): + return client.dataframe( + df, table_name="bench_numeric", at="ts") + + # Cold: the very first flush on a fresh pooled connection. + cold_sample, cold_cpu, _ = timed_call(once) + # Warm: subsequent flushes reuse the same connection / symbol cache. + for _ in range(warm_iters): + elapsed, cpu_elapsed, _ = timed_call(once) + warm_samples.append(elapsed) + warm_cpu.append(cpu_elapsed) + columnar_io_stats = _finish_columnar_io_stats(1 + warm_iters) + + stats = server.snapshot() + reconnects_after_first = max(0, stats["accepted_connections"] - 1) + if reconnects_after_first: + raise AssertionError( + "cold/warm split opened extra physical connections: " + f"{stats['accepted_connections']} accepts (warm flushes must " + "reuse the cold connection)") + if stats["errors"]: + raise AssertionError( + "ACK server observed errors: " + "; ".join(stats["errors"])) + last = { + "ack_server": stats, + "ack_delay_s": ack_delay_s, + "columnar_io_stats": columnar_io_stats, + "pool_conf": conf, + "warm_iters": warm_iters, + "rows_ingested": rows * (1 + warm_iters), + } + return cold_sample, cold_cpu, warm_samples, warm_cpu, last + + +def run_columnar_populate( + df, rows, iterations, warmups, max_rows_per_chunk=None): + def once(): + kwargs = {} + if max_rows_per_chunk is not None: + kwargs["max_rows_per_chunk"] = max_rows_per_chunk + return qi._bench_dataframe_plan_and_populate_column_chunks( + df, + table_name="bench_numeric", + at="ts", + **kwargs) + + for _ in range(warmups): + result = once() + if result["row_path_cell_emissions"] != 0: + raise AssertionError( + "columnar benchmark emitted row-path cells during warmup") + + samples = [] + cpu_samples = [] + last = None + for _ in range(iterations): + elapsed, cpu_elapsed, last = timed_call(once) + if last["row_path_cell_emissions"] != 0: + raise AssertionError( + "columnar benchmark emitted row-path cells during timed run") + if last["populated_rows_total"] != rows: + raise AssertionError( + f"expected {rows} populated rows, got " + f"{last['populated_rows_total']}") + samples.append(elapsed) + cpu_samples.append(cpu_elapsed) + return samples, cpu_samples, last + + +def run_arrow_materialize(df, rows, iterations, warmups): + if pa is None: + raise RuntimeError("pyarrow is not installed") + + def once(): + table = pa.Table.from_pandas(df, preserve_index=False) + return { + "arrow_rows": table.num_rows, + "arrow_columns": table.num_columns, + "arrow_bytes": table.nbytes, + } + + for _ in range(warmups): + once() + + samples = [] + cpu_samples = [] + last = None + for _ in range(iterations): + elapsed, cpu_elapsed, last = timed_call(once) + if last["arrow_rows"] != rows: + raise AssertionError( + f"expected {rows} Arrow rows, got {last['arrow_rows']}") + samples.append(elapsed) + cpu_samples.append(cpu_elapsed) + return samples, cpu_samples, last + + +def run_real_row_path( + df, + rows, + iterations, + warmups, + *, + conf, + table_name, + http_base=None, + setup_sqls=(), + reset_sqls=(), + await_ack_ms=30000): + row_conf = strip_conf_keys(conf, {"pool_size", "pool_max", "pool_reap"}) + setup_results = execute_sqls(http_base, setup_sqls) + reset_count = 0 + samples = [] + cpu_samples = [] + last = None + + def reset(): + nonlocal reset_count + execute_sqls(http_base, reset_sqls) + reset_count += len(reset_sqls) + + with qi.Sender.from_conf(row_conf, auto_flush=False) as sender: + def once(): + sender.dataframe(df, table_name=table_name, at="ts") + fsn = sender.flush_and_get_fsn() + acked = True + if fsn is not None: + acked = sender.await_acked_fsn(fsn, await_ack_ms) + if not acked: + raise TimeoutError( + f"QWP/WebSocket ACK timeout waiting for FSN {fsn}") + return { + "acked": acked, + "flushes": 1 if fsn is not None else 0, + "fsn": fsn, + "table_name": table_name, + } + + for _ in range(warmups): + reset() + once() + + for _ in range(iterations): + reset() + elapsed, cpu_elapsed, last = timed_call(once) + samples.append(elapsed) + cpu_samples.append(cpu_elapsed) + + if last is None: + last = {} + last.update({ + "await_ack_ms": await_ack_ms, + "conf": row_conf, + "path": "real-row", + "reset_sql_count": reset_count, + "rows_ingested": rows * iterations, + "setup_sql_count": len(setup_sqls), + "setup_sql_results": setup_results, + "total_calls": iterations + warmups, + }) + return samples, cpu_samples, last + + +def run_real_client_path( + df, + rows, + iterations, + warmups, + *, + conf, + table_name, + http_base=None, + setup_sqls=(), + reset_sqls=()): + setup_results = execute_sqls(http_base, setup_sqls) + reset_count = 0 + samples = [] + cpu_samples = [] + last = None + + def reset(): + nonlocal reset_count + execute_sqls(http_base, reset_sqls) + reset_count += len(reset_sqls) + + with qi.Client.from_conf(conf) as client: + def once(): + client.dataframe(df, table_name=table_name, at="ts") + return { + "table_name": table_name, + } + + qi._debug_dataframe_columnar_io_stats(enabled=False, reset=True) + for _ in range(warmups): + reset() + once() + + qi._debug_dataframe_columnar_io_stats(enabled=True, reset=True) + try: + for _ in range(iterations): + reset() + elapsed, cpu_elapsed, last = timed_call(once) + samples.append(elapsed) + cpu_samples.append(cpu_elapsed) + finally: + columnar_io_stats = _finish_columnar_io_stats(iterations) + + if last is None: + last = {} + last.update({ + "columnar_io_stats": columnar_io_stats, + "conf": conf, + "path": "real-client", + "reset_sql_count": reset_count, + "rows_ingested": rows * iterations, + "setup_sql_count": len(setup_sqls), + "setup_sql_results": setup_results, + "total_calls": iterations + warmups, + }) + return samples, cpu_samples, last + + +def _exception_report(exc): + return { + "type": type(exc).__name__, + "message": str(exc), + "column_failures": list(getattr(exc, "column_failures", ())), + } + + +def _bench_table_name(schema_name): + return f"bench_{schema_name.replace('-', '_')}" + + +def schema_sql_report(schema_name): + table_name = _bench_table_name(schema_name) + return { + "schema": schema_name, + "table_name": table_name, + "drop_sql": f"DROP TABLE IF EXISTS {table_name}", + "create_sql": ( + SCHEMA_CREATE_SQL[schema_name] + .strip() + .format(table=table_name)), + "truncate_sql": f"TRUNCATE TABLE {table_name}", + } + + +def columnar_support_report(schema_name, rows, max_rows_per_chunk=None, + *, sym_card=DEFAULT_SYM_CARD, + varchar_len=DEFAULT_VARCHAR_LEN): + df = build_schema_df( + schema_name, rows, sym_card=sym_card, varchar_len=varchar_len) + table_name = _bench_table_name(schema_name) + plan = qi._debug_dataframe_columnar_plan( + df, + table_name=table_name, + at="ts") + report = { + "schema": schema_name, + "rows": rows, + "columns": len(df.columns), + "dtypes": {name: str(dtype) for name, dtype in df.dtypes.items()}, + "columnar_plan": plan, + } + if plan["supported"]: + kwargs = {} + if max_rows_per_chunk is not None: + kwargs["max_rows_per_chunk"] = max_rows_per_chunk + chunk_plan = qi._bench_dataframe_plan_and_populate_column_chunks( + df, + table_name=table_name, + at="ts", + **kwargs) + report["chunk_plan"] = chunk_plan + report["fast_path_assertion"] = { + "row_path_cell_emissions": chunk_plan["row_path_cell_emissions"], + "passed": chunk_plan["row_path_cell_emissions"] == 0, + } + else: + with QwpAckServer() as server: + try: + with qi.Client.from_conf(_make_ack_conf(server)) as client: + client.dataframe(df, table_name=table_name, at="ts") + except qi.UnsupportedDataFrameShapeError as exc: + report["client_rejection"] = _exception_report(exc) + stats = server.snapshot() + report["rejection_publication_check"] = { + "accepted_connections": stats["accepted_connections"], + "binary_frames": stats["binary_frames"], + "qwp1_frames": stats["qwp1_frames"], + "binary_bytes": stats["binary_bytes"], + "errors": stats["errors"], + "passed": ( + stats["binary_frames"] == 0 and + stats["qwp1_frames"] == 0 and + not stats["errors"] and + report.get("client_rejection", {}).get("type") == + "UnsupportedDataFrameShapeError"), + } + return report + + +PATHS = { + "row": run_row_path, + "client-ack": run_client_ack, + "client-ack-reuse": run_client_ack, + "columnar-populate": run_columnar_populate, + "arrow-materialize": run_arrow_materialize, + "real-client": run_real_client_path, + "real-row": run_real_row_path, +} + +# JSON contract v1 (plan s3.2): each path is tagged with a phase. "floor" is a +# no-network measurement (populate/encode/materialize); "e2e" includes the +# round trip to a (mock or real) server. +PATH_PHASE = { + "row": "floor", + "columnar-populate": "floor", + "arrow-materialize": "floor", + "client-ack": "e2e", + "client-ack-reuse": "e2e", + "real-client": "e2e", + "real-row": "e2e", +} + + +_MIB = 1024.0 * 1024.0 + + +def add_rates(summary, rows, columns, wire_bytes=None): + median = summary["median_s"] + summary["rows_per_s_median"] = rows / median if median else None + summary["cells_per_s_median"] = rows * columns / median if median else None + # mib_per_s is only meaningful when bytes actually crossed the wire; the + # no-network floor paths leave wire_bytes None (plan s3.2). + if wire_bytes and median: + summary["mib_per_s"] = (wire_bytes / _MIB) / median + else: + summary["mib_per_s"] = None + + +def add_cpu_summary(summary, cpu_samples, rows, columns, wire_bytes=None): + cpu_summary = summarize(cpu_samples) + add_rates(cpu_summary, rows, columns, wire_bytes) + summary["process_cpu"] = cpu_summary + + +def compute_wire_bytes(df): + """Encode the DataFrame once to a QWP buffer to learn the per-flush wire + size, used for the mib_per_s metric in the JSON contract (plan s3.2). + + This is the bytes pushed per ``Client.dataframe`` flush for this schema; it + is deterministic for a given DataFrame, so one encode suffices. Paths that + talk to a server prefer the bytes the server actually observed (see + ``measured_wire_bytes_per_call``); this is the fallback estimate. + """ + buf = qi.Buffer.qwp() + buf.dataframe(df, table_name="bench_wire_size", at="ts") + return len(buf) + + +def measured_wire_bytes_per_call(last): + """Per-flush wire bytes observed by the mock ACK server, if available. + + The ACK server counts every binary frame byte it received; dividing by the + timed call count gives the real bytes-on-wire per flush, which is more + honest than the one-shot buffer estimate (it includes WS framing and the + warm symbol-dict). Returns ``None`` when no ACK snapshot is present. + """ + if not isinstance(last, dict): + return None + ack = last.get("ack_server") + timed = last.get("timed_calls") + if not ack or not timed: + return None + total = ack.get("binary_bytes") + if not total: + return None + return total / timed + + +def _machine_block(): + return { + "python": sys.version, + "platform": platform.platform(), + "processor": platform.processor(), + "pandas": pd.__version__, + "numpy": np.__version__, + "pyarrow": pa.__version__ if pa is not None else None, + } + + +def _commits_block(): + return { + "py_questdb_client": git_rev(os.getcwd()), + "c_questdb_client": git_rev( + os.path.join(os.getcwd(), "c-questdb-client")), + } + + +def _path_summary(samples, cpu_samples, rows, columns, *, phase, warm, + wire_bytes, last=None): + """Build a contract-conformant per-path summary block (plan s3.2).""" + rate_wire_bytes = wire_bytes if phase == "e2e" else None + summary = summarize(samples) + add_rates(summary, rows, columns, rate_wire_bytes) + add_cpu_summary(summary, cpu_samples, rows, columns, rate_wire_bytes) + summary["phase"] = phase + summary["warm"] = warm + summary["wire_bytes"] = wire_bytes + if last is not None: + summary["last"] = last + return summary + + +def pandas_to_questdb_throughput( + *, + rows, + iterations, + warmups, + sym_card=DEFAULT_SYM_CARD, + varchar_len=DEFAULT_VARCHAR_LEN, + run_mode="full", + real_conf=None, + real_http=None, + real_table=None, + real_setup_sql=(), + real_reset_sql=(), + max_rows_per_chunk=None, + schema="s1-narrow"): + """WS-7 headline deliverable (plan s4): one call that yields S1 ingress + rows/s + MiB/s for the no-network floor *and* the end-to-end path, plus the + cold first-flush vs warm steady-state split and the honest + populate_plus_encode sum (plan s3.6). + + * ``columnar-populate`` is the populate floor (descriptor building only). + * the cold/warm split runs against the in-process mock ACK server so the + encode + flush cost is measured without needing a server; the warm median + is the honest ``populate_plus_encode`` headline. + * when ``real_conf`` is given, ``real-client`` adds the true end-to-end + number against a live QuestDB and the DEDUP ``count() == rows`` gate. + + Ack level is ``Ok`` (the mock server and the default qwpws conf). ``Durable`` + is Enterprise (``request_durable_ack=on``) and is deferred (plan s13). + """ + df = build_schema_df( + schema, rows, sym_card=sym_card, varchar_len=varchar_len) + columns = len(df.columns) + try: + wire_bytes = compute_wire_bytes(df) + except Exception: + wire_bytes = None + + paths = {} + + # Floor: populate only (no encode, no wire). + populate_samples, populate_cpu, populate_last = run_columnar_populate( + df, rows, iterations, warmups, max_rows_per_chunk) + paths["columnar-populate"] = _path_summary( + populate_samples, populate_cpu, rows, columns, + phase="floor", warm=warmups > 0, wire_bytes=wire_bytes, + last=populate_last) + + # Cold/warm split over the in-process mock server (server-free e2e). + cold_s, cold_cpu, warm_samples, warm_cpu, split_last = run_cold_warm_split( + df, rows, max(iterations, 1)) + measured = measured_wire_bytes_per_call(split_last) + e2e_wire_bytes = measured if measured is not None else wire_bytes + cold_summary = _path_summary( + [cold_s], [cold_cpu], rows, columns, + phase="e2e", warm=False, wire_bytes=e2e_wire_bytes) + warm_summary = _path_summary( + warm_samples, warm_cpu, rows, columns, + phase="e2e", warm=True, wire_bytes=e2e_wire_bytes, last=split_last) + paths["mock-cold-first-flush"] = cold_summary + paths["mock-warm-steady-state"] = warm_summary + + # Optional true end-to-end against a live QuestDB (DEDUP count()==rows gate + # is enforced by the layer3 fixture; here we just record the rate). + if real_conf: + table_name = real_table or _bench_table_name(schema) + e2e_samples, e2e_cpu, e2e_last = run_real_client_path( + df, rows, iterations, warmups, + conf=real_conf, table_name=table_name, http_base=real_http, + setup_sqls=real_setup_sql, reset_sqls=real_reset_sql) + real_measured = measured_wire_bytes_per_call(e2e_last) + paths["real-client"] = _path_summary( + e2e_samples, e2e_cpu, rows, columns, + phase="e2e", warm=warmups > 0, + wire_bytes=real_measured if real_measured is not None + else wire_bytes, + last=e2e_last) + + # Honest sum (plan s3.6): the warm e2e flush already includes populate + + # encode + flush, so it *is* populate_plus_encode. We surface the populate + # floor and the marginal encode+io cost alongside, never headlining the + # near-free descriptor append on its own. + populate_s = paths["columnar-populate"]["median_s"] + warm_e2e_s = warm_summary["median_s"] + encode_plus_io_s = max(warm_e2e_s - populate_s, 0.0) + headline = { + "populate_floor_s": populate_s, + "encode_plus_io_s": encode_plus_io_s, + "populate_plus_encode_s": warm_e2e_s, + "populate_plus_encode_rows_per_s": ( + rows / warm_e2e_s if warm_e2e_s else None), + "populate_plus_encode_mib_per_s": ( + (e2e_wire_bytes / _MIB) / warm_e2e_s + if e2e_wire_bytes and warm_e2e_s else None), + "cold_first_flush_s": cold_summary["median_s"], + "warm_steady_state_s": warm_e2e_s, + "cold_over_warm_ratio": ( + cold_summary["median_s"] / warm_e2e_s if warm_e2e_s else None), + "warm_from_pool": True, + } + if real_conf: + headline["real_client_s"] = paths["real-client"]["median_s"] + headline["real_client_rows_per_s"] = ( + paths["real-client"]["rows_per_s_median"]) + headline["real_client_mib_per_s"] = ( + paths["real-client"]["mib_per_s"]) + + return { + "schema": schema, + "rows": rows, + "columns": columns, + "dtypes": {name: str(dtype) for name, dtype in df.dtypes.items()}, + "direction": "ingress", + "client": "py-pandas", + "run_mode": run_mode, + "warmups": warmups, + "wire_bytes": wire_bytes, + "ack_level": "Ok", + "machine": _machine_block(), + "commits": _commits_block(), + "headline": headline, + "paths": paths, + } + + +def main(): + parser = argparse.ArgumentParser( + description=( + "Layer 1 pandas columnar benchmark: row-buffer serialization " + "versus #148 chunk population, plus Arrow materialization.")) + parser.add_argument( + "--schema", + choices=sorted(SCHEMAS) + ["all"], + default="numeric-core") + parser.add_argument( + "--rows", + type=int, + default=_env_int("QUESTDB_COLUMN_BENCH_ROWS", 100_000), + help="Rows per DataFrame (env: QUESTDB_COLUMN_BENCH_ROWS).") + parser.add_argument("--iterations", type=int, default=20) + parser.add_argument("--warmups", type=int, default=3) + parser.add_argument( + "--sym-card", + type=int, + default=_env_int("QUESTDB_COLUMN_BENCH_SYM_CARD", DEFAULT_SYM_CARD), + help=( + "SYMBOL cardinality for the s1-narrow schema " + "(env: QUESTDB_COLUMN_BENCH_SYM_CARD).")) + parser.add_argument( + "--varchar-len", + type=int, + default=_env_int( + "QUESTDB_COLUMN_BENCH_VARCHAR_LEN", DEFAULT_VARCHAR_LEN), + help=( + "VARCHAR byte length for the s1-narrow schema " + "(env: QUESTDB_COLUMN_BENCH_VARCHAR_LEN).")) + parser.add_argument( + "--run-mode", + choices=["quick", "full"], + default="full", + help=( + "Recorded in the JSON contract (plan s3.2). 'quick' is the CI " + "shape; 'full' is the headline shape.")) + parser.add_argument( + "--max-rows-per-chunk", + type=int, + help="Override the internal columnar row chunk cap.") + parser.add_argument( + "--ack-delay-ms", + type=float, + default=0.0, + help="Delay each local QWP/WebSocket ACK by this many milliseconds.") + parser.add_argument( + "--ack-reuse-min-calls", + type=int, + default=100, + help="Minimum timed calls for the client-ack-reuse path.") + parser.add_argument( + "--ack-reuse-max-seconds", + type=float, + default=10.0, + help="Maximum timed seconds for the client-ack-reuse path.") + parser.add_argument( + "--real-conf", + help="QWP/WebSocket configuration string for real-server runs.") + parser.add_argument( + "--real-http", + help=( + "QuestDB HTTP base URL for setup/reset SQL in real-server runs.")) + parser.add_argument( + "--real-table", + help="Target table name for real-server runs.") + parser.add_argument( + "--real-await-ack-ms", + type=int, + default=30000, + help="ACK timeout for the real-row path.") + parser.add_argument( + "--real-setup-sql", + action="append", + default=[], + help="SQL executed once before real-server warmups.") + parser.add_argument( + "--real-reset-sql", + action="append", + default=[], + help="SQL executed before each real-server warmup/timed iteration.") + parser.add_argument( + "--path", + choices=sorted(PATHS), + action="append", + help="Path to run. Defaults to all paths.") + parser.add_argument( + "--pretty", + action="store_true", + help="Pretty-print JSON output.") + parser.add_argument( + "--support-report", + action="store_true", + help=( + "Report Client.dataframe v1 eligibility, chunk planning, and " + "pre-publication rejection details instead of timing paths.")) + parser.add_argument( + "--schema-sql", + action="store_true", + help=( + "Print QuestDB DROP/CREATE/TRUNCATE SQL metadata for selected " + "benchmark schemas and exit.")) + parser.add_argument( + "--headline", + action="store_true", + help=( + "Run the pandas_to_questdb_throughput headline (plan s4): the " + "columnar-populate floor + the cold/warm e2e split (mock server) + " + "the populate_plus_encode sum, on the selected schema (default " + "s1-narrow). Add --real-conf to include the live-server " + "real-client number.")) + args = parser.parse_args() + + if args.headline: + schema = "s1-narrow" if args.schema == "numeric-core" else args.schema + if any(opt and not args.real_http + for opt in (args.real_setup_sql, args.real_reset_sql)): + parser.error("--real-http is required with real setup/reset SQL") + result = pandas_to_questdb_throughput( + rows=args.rows, + iterations=args.iterations, + warmups=args.warmups, + sym_card=args.sym_card, + varchar_len=args.varchar_len, + run_mode=args.run_mode, + real_conf=args.real_conf, + real_http=args.real_http, + real_table=args.real_table, + real_setup_sql=args.real_setup_sql, + real_reset_sql=args.real_reset_sql, + max_rows_per_chunk=args.max_rows_per_chunk, + schema=schema) + print(json.dumps( + result, + indent=2 if args.pretty else None, + sort_keys=True)) + return + + if args.schema_sql: + schema_names = ( + sorted(SCHEMAS) if args.schema == "all" else [args.schema]) + output = { + "schemas": [ + schema_sql_report(schema_name) + for schema_name in schema_names + ], + } + print(json.dumps( + output, + indent=2 if args.pretty else None, + sort_keys=True)) + return + + if args.support_report: + schema_names = ( + sorted(SCHEMAS) if args.schema == "all" else [args.schema]) + reports = [ + columnar_support_report( + schema_name, + args.rows, + args.max_rows_per_chunk, + sym_card=args.sym_card, + varchar_len=args.varchar_len) + for schema_name in schema_names + ] + output = { + "rows": args.rows, + "reports": reports, + } + print(json.dumps( + output, + indent=2 if args.pretty else None, + sort_keys=True)) + return + + if args.schema == "all": + parser.error("--schema all requires --support-report") + + paths = args.path or [ + "row", + "columnar-populate", + "arrow-materialize", + "client-ack"] + real_table = args.real_table or _bench_table_name(args.schema) + if any(path.startswith("real-") for path in paths): + if not args.real_conf: + parser.error("real-server paths require --real-conf") + if (args.real_setup_sql or args.real_reset_sql) and not args.real_http: + parser.error("--real-http is required with real setup/reset SQL") + df = build_schema_df( + args.schema, + args.rows, + sym_card=args.sym_card, + varchar_len=args.varchar_len) + + results = { + "schema": args.schema, + "rows": args.rows, + "columns": len(df.columns), + "dtypes": {name: str(dtype) for name, dtype in df.dtypes.items()}, + "direction": "ingress", + "client": "py-pandas", + "run_mode": args.run_mode, + "warmups": args.warmups, + "machine": { + "python": sys.version, + "platform": platform.platform(), + "processor": platform.processor(), + "pandas": pd.__version__, + "numpy": np.__version__, + "pyarrow": pa.__version__ if pa is not None else None, + }, + "commits": { + "py_questdb_client": git_rev(os.getcwd()), + "c_questdb_client": git_rev( + os.path.join(os.getcwd(), "c-questdb-client")), + }, + "paths": {}, + } + + # Per-flush wire size for this schema (used by mib_per_s in the contract). + # Only DataFrames the columnar/row path can encode have a meaningful size; + # rejection schemas are skipped (they never reach the wire). + try: + wire_bytes = compute_wire_bytes(df) + except Exception: + wire_bytes = None + results["wire_bytes"] = wire_bytes + + for path in paths: + if path == "columnar-populate": + samples, cpu_samples, last = run_columnar_populate( + df, + args.rows, + args.iterations, + args.warmups, + args.max_rows_per_chunk) + elif path == "client-ack": + samples, cpu_samples, last = run_client_ack( + df, + args.rows, + args.iterations, + args.warmups, + ack_delay_s=args.ack_delay_ms / 1000.0) + elif path == "client-ack-reuse": + samples, cpu_samples, last = run_client_ack( + df, + args.rows, + max(args.iterations, args.ack_reuse_min_calls), + args.warmups, + min_calls=args.ack_reuse_min_calls, + max_seconds=args.ack_reuse_max_seconds, + ack_delay_s=args.ack_delay_ms / 1000.0) + elif path == "real-row": + samples, cpu_samples, last = run_real_row_path( + df, + args.rows, + args.iterations, + args.warmups, + conf=args.real_conf, + table_name=real_table, + http_base=args.real_http, + setup_sqls=args.real_setup_sql, + reset_sqls=args.real_reset_sql, + await_ack_ms=args.real_await_ack_ms) + elif path == "real-client": + samples, cpu_samples, last = run_real_client_path( + df, + args.rows, + args.iterations, + args.warmups, + conf=args.real_conf, + table_name=real_table, + http_base=args.real_http, + setup_sqls=args.real_setup_sql, + reset_sqls=args.real_reset_sql) + else: + samples, cpu_samples, last = PATHS[path]( + df, + args.rows, + args.iterations, + args.warmups) + phase = PATH_PHASE.get(path, "e2e") + # Prefer the bytes the mock server actually observed; fall back to the + # one-shot encode estimate. mib_per_s is wire throughput, so it only + # applies to e2e paths; floor paths record wire_bytes for reference but + # report no rate. + measured = measured_wire_bytes_per_call(last) + path_wire_bytes_report = measured if measured is not None else wire_bytes + rate_wire_bytes = path_wire_bytes_report if phase == "e2e" else None + summary = summarize(samples) + add_rates(summary, args.rows, len(df.columns), rate_wire_bytes) + add_cpu_summary( + summary, cpu_samples, args.rows, len(df.columns), rate_wire_bytes) + summary["phase"] = phase + summary["wire_bytes"] = path_wire_bytes_report + # The timed samples are warm steady-state whenever warmups ran; the + # cold first-flush is reported separately by the cold/warm split. + summary["warm"] = args.warmups > 0 + summary["last"] = last + results["paths"][path] = summary + + print(json.dumps(results, indent=2 if args.pretty else None, sort_keys=True)) + + +if __name__ == "__main__": + main() diff --git a/test/benchmark_pandas_egress.py b/test/benchmark_pandas_egress.py new file mode 100644 index 00000000..d79f4ebc --- /dev/null +++ b/test/benchmark_pandas_egress.py @@ -0,0 +1,429 @@ +#!/usr/bin/env python3 +"""Step 2 pandas egress benchmark (QWP_DATAFRAME_BENCH_PLAN.md s5). + +Mirror of the ingress harness (``benchmark_pandas_columnar.py``): reads the +s1-narrow table back from a real QuestDB over QWP/WebSocket and measures the +decode -> DataFrame paths, emitting the identical JSON metric contract with +``direction="egress"``. + +Paths (plan s5.3): + +* ``decode-only`` -- iterate the cursor's Arrow batches without building a + DataFrame (the egress floor; analog of + ``columnar-populate``). +* ``to-pandas`` -- default numpy materialise (the headline). +* ``to-polars`` -- Polars output (shares the Arrow path). +* ``arrow-c-stream`` -- ``__arrow_c_stream__`` -> ``polars.from_arrow`` (no + pyarrow on the consumer side). +* ``iter-pandas`` -- lazy per-batch materialise vs ``to-pandas`` full. + +The headline run pairs ``decode-only`` (floor) + ``to-pandas`` (e2e) and reports +the honest ``decode_plus_assemble`` sum (plan s3.6). + +Because egress decodes a *server* result, every path needs a populated table; +there is no server-free floor here (the server-free RESULT_BATCH replay server +is deferred to plan Step 5). Point this at a table the ingress side already +filled, or use ``run_pandas_egress_layer3.py`` which ingests then reads back in +one shot. +""" + +import argparse +import gc +import json +import os +import platform +import sys +import time + +sys.dont_write_bytecode = True + +import numpy as np +import pandas as pd + +try: + import pyarrow as pa +except ImportError: + pa = None + +import patch_path +import questdb.ingress as qi + +# Reuse the ingress spine: schema generator, the JSON-contract helpers, SQL +# helpers, timing, and the table-name convention all stay shared so the two +# directions emit the same shape and the parity aggregator sees one schema. +from benchmark_pandas_columnar import ( + DEFAULT_SYM_CARD, + DEFAULT_VARCHAR_LEN, + _bench_table_name, + _commits_block, + _env_int, + _machine_block, + _path_summary, + build_schema_df, + execute_sql, + summarize, + timed_call, +) + + +# Egress path phases (plan s3.2): decode-only is the no-assemble floor; the +# materialise paths are the end-to-end decode+assemble. +PATH_PHASE = { + "decode-only": "floor", + "to-pandas": "e2e", + "to-pandas-arrow": "e2e", + "to-pandas-nullable": "e2e", + "to-polars": "e2e", + "arrow-c-stream": "e2e", + "iter-pandas": "e2e", +} + +ALL_PATHS = list(PATH_PHASE) + + +def _require_pyarrow(): + if pa is None: + raise RuntimeError("pyarrow is not installed") + + +def _drain_arrow(result): + """Floor: pull every Arrow RecordBatch and touch it, but build no + DataFrame. This is the decode cost with zero assembly.""" + rows = 0 + cols = 0 + for batch in result.iter_arrow(): + rows += batch.num_rows + cols = batch.num_columns + return {"rows": rows, "columns": cols} + + +def _to_pandas(result): + df = result.to_pandas() + return {"rows": len(df), "columns": len(df.columns)} + + +def _to_pandas_arrow(result): + # Arrow-backed pandas: dtype_backend="pyarrow" routes through to_arrow() + # and maps every Arrow type to pandas ArrowDtype, so VARCHAR stays an + # Arrow string array (three O(1) buffers shared from the decode) instead + # of an object column of N per-row Python str objects. This is the + # cardinality-independent VARCHAR egress win (vs the numpy default). + df = result.to_pandas(dtype_backend="pyarrow") + return {"rows": len(df), "columns": len(df.columns)} + + +def _to_pandas_nullable(result): + # numpy-nullable backend: Arrow -> pandas masked extension dtypes + # (Int64/Float64/boolean). VARCHAR maps to StringDtype (string[python]), + # so it still materialises N Python str objects like the numpy default; + # this arm isolates the masked-numeric overhead from the Arrow-string win. + df = result.to_pandas(dtype_backend="numpy_nullable") + return {"rows": len(df), "columns": len(df.columns)} + + +def _to_polars(result): + df = result.to_polars() + return {"rows": df.height, "columns": df.width} + + +def _arrow_c_stream(result, pl): + # Consume the native __arrow_c_stream__ capsule with polars (no pyarrow on + # the consumer side). polars.from_arrow accepts any object exposing the + # Arrow C stream protocol. + df = pl.from_arrow(result) + return {"rows": df.height, "columns": df.width} + + +def _iter_pandas(result): + rows = 0 + cols = 0 + for df in result.iter_pandas(): + rows += len(df) + cols = len(df.columns) + return {"rows": rows, "columns": cols} + + +def _make_runner(path, pl): + if path == "decode-only": + _require_pyarrow() + return _drain_arrow + if path == "to-pandas": + _require_pyarrow() + return _to_pandas + if path == "to-pandas-arrow": + _require_pyarrow() + return _to_pandas_arrow + if path == "to-pandas-nullable": + _require_pyarrow() + return _to_pandas_nullable + if path == "to-polars": + return _to_polars + if path == "arrow-c-stream": + if pl is None: + raise RuntimeError("polars is required for arrow-c-stream") + return lambda result: _arrow_c_stream(result, pl) + if path == "iter-pandas": + _require_pyarrow() + return _iter_pandas + raise ValueError(f"unknown egress path: {path}") + + +def run_egress_path( + path, + *, + client, + sql, + rows, + iterations, + warmups, + pl=None): + """Time one egress path. Each iteration issues a fresh query (QueryResult + is single-use) and materialises it via ``path``; the timed region covers + the query round-trip + decode (+ assemble for the e2e paths).""" + runner = _make_runner(path, pl) + + def once(): + with client.query(sql) as result: + out = runner(result) + if out["rows"] != rows: + raise AssertionError( + f"{path}: read back {out['rows']} rows, expected {rows}") + return out + + for _ in range(warmups): + once() + + samples = [] + cpu_samples = [] + last = None + for _ in range(iterations): + elapsed, cpu_elapsed, last = timed_call(once) + samples.append(elapsed) + cpu_samples.append(cpu_elapsed) + return samples, cpu_samples, last + + +def measure_egress_wire_bytes(client, sql): + """Per-query wire payload size for the mib_per_s metric (plan s3.2). + + Uses the materialised Arrow table's nbytes as the on-wire payload proxy: + it is the decoded column-buffer size the server streamed, deterministic for + a given table, and the natural egress analog of the ingress wire_bytes. + """ + _require_pyarrow() + with client.query(sql) as result: + table = result.to_arrow() + return int(table.nbytes) + + +def verify_zero_copy(client, sql): + """Characterisation deliverable (plan s5.4): confirm the fixed-width fast + path is zero-copy on the Arrow side. + + The numpy ``to_pandas`` path bulk-copies each fixed column out of the + transient wire buffer (``_numpy_fixed_chunk`` -> ``np.frombuffer(...).copy()``) + because the buffer is recycled. The genuine zero-copy surface is the Arrow + batch (``iter_arrow`` / ``__arrow_c_stream__``): its column buffers are the + decoded buffers exposed through the Arrow C Data Interface. We assert that a + numpy view built from a fixed-width Arrow column buffer shares memory with a + numpy array sliced from the same pyarrow column (no copy in between). + """ + _require_pyarrow() + report = {"checked_columns": [], "zero_copy": None} + with client.query(sql) as result: + reader = result.iter_arrow() + try: + batch = next(reader) + except StopIteration: + report["zero_copy"] = False + report["note"] = "no batches returned" + return report + # Fixed-width numeric columns (id LONG, price DOUBLE) decode to + # contiguous Arrow buffers we can view zero-copy. + ok_any = False + for name in ("id", "price"): + if name not in batch.schema.names: + continue + col = batch.column(batch.schema.names.index(name)) + # pyarrow zero-copy to numpy for a no-null primitive column. + arr = col.to_numpy(zero_copy_only=True) + # The Arrow buffer underlying the column; build a second numpy view + # straight off its address and assert it aliases the same memory. + buffers = col.buffers() + data_buf = buffers[-1] + view = np.frombuffer(data_buf, dtype=arr.dtype, count=len(arr)) + shares = bool(np.shares_memory(arr, view)) + report["checked_columns"].append({ + "column": name, + "zero_copy_to_numpy": True, + "shares_memory_with_buffer": shares, + }) + ok_any = ok_any or shares + # Drain the rest so the cursor releases cleanly. + for _ in reader: + pass + report["zero_copy"] = ok_any + return report + + +def build_egress_report( + *, + client, + table_name, + rows, + columns, + iterations, + warmups, + run_mode, + paths, + wire_bytes, + schema="s1-narrow", + zero_copy=None, + extra=None): + sql = f"SELECT * FROM {table_name}" + try: + import polars as pl + except ImportError: + pl = None + + path_results = {} + for path in paths: + samples, cpu_samples, last = run_egress_path( + path, + client=client, + sql=sql, + rows=rows, + iterations=iterations, + warmups=warmups, + pl=pl) + phase = PATH_PHASE.get(path, "e2e") + path_results[path] = _path_summary( + samples, cpu_samples, rows, columns, + phase=phase, warm=warmups > 0, wire_bytes=wire_bytes, last=last) + + # Honest sum (plan s3.6): decode_plus_assemble = the to-pandas e2e (it + # already includes decode + assemble); decode-only is the floor; the + # marginal assemble is the difference. Headline the sum, never the floor. + headline = {} + if "decode-only" in path_results and "to-pandas" in path_results: + decode_s = path_results["decode-only"]["median_s"] + assemble_e2e_s = path_results["to-pandas"]["median_s"] + headline = { + "decode_floor_s": decode_s, + "assemble_plus_io_s": max(assemble_e2e_s - decode_s, 0.0), + "decode_plus_assemble_s": assemble_e2e_s, + "decode_plus_assemble_rows_per_s": ( + rows / assemble_e2e_s if assemble_e2e_s else None), + "decode_plus_assemble_mib_per_s": ( + (wire_bytes / (1024.0 * 1024.0)) / assemble_e2e_s + if wire_bytes and assemble_e2e_s else None), + } + for alt in ("to-pandas-arrow", "to-pandas-nullable", + "to-polars", "arrow-c-stream"): + if alt in path_results: + headline[f"{alt}_rows_per_s"] = ( + path_results[alt]["rows_per_s_median"]) + headline[f"{alt}_mib_per_s"] = path_results[alt]["mib_per_s"] + + report = { + "schema": schema, + "rows": rows, + "columns": columns, + "direction": "egress", + "client": "py-pandas", + "run_mode": run_mode, + "warmups": warmups, + "wire_bytes": wire_bytes, + "machine": _machine_block(), + "commits": _commits_block(), + "headline": headline, + "paths": path_results, + } + if zero_copy is not None: + report["zero_copy_check"] = zero_copy + if extra: + report.update(extra) + return report + + +def fetch_row_count(http_base, table_name): + result = execute_sql(http_base, f"SELECT count() FROM {table_name}") + parsed = json.loads(result["body"]) + return parsed["dataset"][0][0] + + +def main(): + parser = argparse.ArgumentParser( + description=( + "pandas egress benchmark: read the s1-narrow table back from a " + "real QuestDB and measure decode -> DataFrame paths.")) + parser.add_argument( + "--rows", + type=int, + default=_env_int("QUESTDB_COLUMN_BENCH_ROWS", 100_000), + help="Rows expected in the table (env: QUESTDB_COLUMN_BENCH_ROWS).") + parser.add_argument("--iterations", type=int, default=10) + parser.add_argument("--warmups", type=int, default=2) + parser.add_argument( + "--run-mode", choices=["quick", "full"], default="full") + parser.add_argument( + "--real-conf", + required=True, + help="QWP/WebSocket configuration string for the real server.") + parser.add_argument( + "--real-http", + help="QuestDB HTTP base URL (for the count() sanity check).") + parser.add_argument( + "--real-table", + help="Table to read back (defaults to the schema's bench table).") + parser.add_argument( + "--schema", default="s1-narrow", + help="Schema name recorded in the report; also picks the default " + "table name (e.g. s1-narrow, s2-wide).") + parser.add_argument( + "--columns", type=int, default=5, + help="Column count recorded in the report (s1-narrow=5, s2-wide=15).") + parser.add_argument( + "--path", + choices=ALL_PATHS, + action="append", + help="Egress path(s) to run. Defaults to all paths.") + parser.add_argument( + "--zero-copy-check", + action="store_true", + help="Assert the fixed-width fast path is zero-copy on the Arrow side.") + parser.add_argument("--pretty", action="store_true") + args = parser.parse_args() + + table_name = args.real_table or _bench_table_name(args.schema) + paths = args.path or ALL_PATHS + + with qi.Client.from_conf(args.real_conf) as client: + if args.real_http is not None: + actual = fetch_row_count(args.real_http, table_name) + if actual != args.rows: + raise AssertionError( + f"table {table_name} has {actual} rows, expected " + f"{args.rows}; ingest the s1-narrow table first") + sql = f"SELECT * FROM {table_name}" + wire_bytes = measure_egress_wire_bytes(client, sql) + zero_copy = verify_zero_copy(client, sql) if args.zero_copy_check \ + else None + report = build_egress_report( + client=client, + table_name=table_name, + rows=args.rows, + columns=args.columns, + schema=args.schema, + iterations=args.iterations, + warmups=args.warmups, + run_mode=args.run_mode, + paths=paths, + wire_bytes=wire_bytes, + zero_copy=zero_copy) + + print(json.dumps(report, indent=2 if args.pretty else None, sort_keys=True)) + + +if __name__ == "__main__": + main() diff --git a/test/qwp_ws_ack_server.py b/test/qwp_ws_ack_server.py new file mode 100644 index 00000000..21d2ad91 --- /dev/null +++ b/test/qwp_ws_ack_server.py @@ -0,0 +1,245 @@ +import base64 +import hashlib +import socket +import struct +import threading +import time + + +WS_GUID = "258EAFA5-E914-47DA-95CA-C5AB0DC85B11" +QWP_STATUS_OK = 0x00 + + +class QwpAckServer: + def __init__(self, *, host="127.0.0.1", ack_delay_s=0.0): + self.host = host + self.ack_delay_s = ack_delay_s + self.port = None + self._sock = None + self._stop = threading.Event() + self._thread = None + self._handlers = [] + self._lock = threading.Lock() + self.accept_count = 0 + self.binary_frame_count = 0 + self.qwp1_frame_count = 0 + self.binary_bytes = 0 + self.binary_prefixes = [] + self.control_frame_count = 0 + self.errors = [] + + def __enter__(self): + self.start() + return self + + def __exit__(self, exc_type, exc, tb): + self.stop() + + def start(self): + self._sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) + self._sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1) + self._sock.bind((self.host, 0)) + self._sock.listen() + self._sock.settimeout(0.2) + self.port = self._sock.getsockname()[1] + self._thread = threading.Thread(target=self._accept_loop, daemon=True) + self._thread.start() + + def stop(self): + self._stop.set() + if self.port is not None: + try: + with socket.create_connection((self.host, self.port), timeout=0.2): + pass + except OSError: + pass + if self._thread is not None: + self._thread.join(timeout=2) + for handler in list(self._handlers): + handler.join(timeout=2) + if self._sock is not None: + try: + self._sock.close() + except OSError: + pass + + def snapshot(self): + with self._lock: + return { + "accepted_connections": self.accept_count, + "binary_frames": self.binary_frame_count, + "qwp1_frames": self.qwp1_frame_count, + "binary_bytes": self.binary_bytes, + "binary_prefixes": list(self.binary_prefixes), + "control_frames": self.control_frame_count, + "errors": list(self.errors), + } + + def _accept_loop(self): + while not self._stop.is_set(): + try: + conn, _addr = self._sock.accept() + except socket.timeout: + continue + except OSError: + break + if self._stop.is_set(): + conn.close() + break + with self._lock: + self.accept_count += 1 + handler = threading.Thread( + target=self._handle_connection, + args=(conn,), + daemon=True) + self._handlers.append(handler) + handler.start() + + def _handle_connection(self, conn): + next_seq = 0 + try: + conn.settimeout(5) + request = _read_until(conn, b"\r\n\r\n") + key = _header(request, "Sec-WebSocket-Key") + accept = _compute_accept(key) + response = ( + "HTTP/1.1 101 Switching Protocols\r\n" + "Upgrade: websocket\r\n" + "Connection: Upgrade\r\n" + f"Sec-WebSocket-Accept: {accept}\r\n" + "X-QWP-Version: 1\r\n" + "\r\n") + conn.sendall(response.encode("ascii")) + + while not self._stop.is_set(): + frame = _read_frame(conn) + if frame is None: + break + _fin, opcode, payload = frame + if opcode == 0x8: + with self._lock: + self.control_frame_count += 1 + _write_frame(conn, 0x8, b"") + break + if opcode == 0x9: + with self._lock: + self.control_frame_count += 1 + _write_frame(conn, 0xA, payload) + continue + if opcode != 0x2: + with self._lock: + self.control_frame_count += 1 + continue + + with self._lock: + self.binary_frame_count += 1 + if payload.startswith(b"QWP1"): + self.qwp1_frame_count += 1 + self.binary_bytes += len(payload) + if len(self.binary_prefixes) < 16: + self.binary_prefixes.append(payload[:8].hex()) + if self.ack_delay_s: + time.sleep(self.ack_delay_s) + _write_qwp_ok(conn, next_seq) + next_seq += 1 + except Exception as exc: + with self._lock: + self.errors.append(repr(exc)) + finally: + try: + conn.close() + except OSError: + pass + + +def _read_exact(conn, length): + chunks = [] + remaining = length + while remaining: + chunk = conn.recv(remaining) + if not chunk: + return None + chunks.append(chunk) + remaining -= len(chunk) + return b"".join(chunks) + + +def _read_until(conn, marker): + data = bytearray() + while marker not in data: + chunk = conn.recv(256) + if not chunk: + raise ConnectionError("connection closed during HTTP upgrade") + data.extend(chunk) + return bytes(data) + + +def _header(request, name): + text = request.decode("iso-8859-1") + prefix = name.lower() + ":" + for line in text.split("\r\n"): + if line.lower().startswith(prefix): + return line.split(":", 1)[1].strip() + raise ValueError(f"missing HTTP header {name}") + + +def _compute_accept(key): + digest = hashlib.sha1((key + WS_GUID).encode("ascii")).digest() + return base64.b64encode(digest).decode("ascii") + + +def _read_frame(conn): + header = _read_exact(conn, 2) + if header is None: + return None + fin = bool(header[0] & 0x80) + opcode = header[0] & 0x0F + masked = bool(header[1] & 0x80) + short_len = header[1] & 0x7F + if short_len == 126: + ext = _read_exact(conn, 2) + if ext is None: + return None + payload_len = struct.unpack("!H", ext)[0] + elif short_len == 127: + ext = _read_exact(conn, 8) + if ext is None: + return None + payload_len = struct.unpack("!Q", ext)[0] + else: + payload_len = short_len + + mask = b"" + if masked: + mask = _read_exact(conn, 4) + if mask is None: + return None + payload = _read_exact(conn, payload_len) + if payload is None: + return None + if masked: + payload = bytes(byte ^ mask[index & 3] + for index, byte in enumerate(payload)) + return fin, opcode, payload + + +def _write_frame(conn, opcode, payload): + frame = bytearray([0x80 | (opcode & 0x0F)]) + payload_len = len(payload) + if payload_len <= 125: + frame.append(payload_len) + elif payload_len <= 0xFFFF: + frame.append(126) + frame.extend(struct.pack("!H", payload_len)) + else: + frame.append(127) + frame.extend(struct.pack("!Q", payload_len)) + frame.extend(payload) + conn.sendall(frame) + + +def _write_qwp_ok(conn, wire_seq): + payload = bytearray([QWP_STATUS_OK]) + payload.extend(struct.pack(" 0 + summary["last"] = last + paths[path_name] = summary + # DEDUP correctness gate: at-least-once replays must not inflate + # the row count past the rows we sent (plan s3.4). + if not row_count_check["ok"]: + raise AssertionError( + f"{path_name}: DEDUP row-count mismatch on " + f"{schema_sql['table_name']}: expected " + f"{row_count_check['expected']}, got " + f"{row_count_check['actual']} (at-least-once inflation or " + "missing DEDUP UPSERT KEYS(ts))") + + return { + "schema": args.schema, + "rows": args.rows, + "columns": len(df.columns), + "direction": "ingress", + "client": "py-pandas", + "run_mode": args.run_mode, + "iterations": args.iterations, + "warmups": args.warmups, + "wire_bytes": wire_bytes, + "questdb_version": ".".join(str(part) for part in version), + "questdb_repo": str(pathlib.Path(args.questdb_repo).resolve()), + "http_base": http_base, + "real_conf": conf, + "schema_sql": schema_sql, + "settings": settings, + "paths": paths, + } + finally: + qdb.stop() + + +def fetch_http_endpoint(http_base, path): + url = http_base.rstrip("/") + path + try: + with urllib.request.urlopen(url, timeout=30) as response: + body = response.read().decode("utf-8", errors="replace") + try: + parsed = json.loads(body) + except json.JSONDecodeError: + parsed = body + return { + "url": url, + "status": response.status, + "body": parsed, + } + except urllib.error.HTTPError as error: + return { + "url": url, + "status": error.code, + "body": error.read().decode("utf-8", errors="replace"), + } + + +def _count(http_base, table_name): + result = execute_sql(http_base, f"SELECT count() FROM {table_name}") + parsed = json.loads(result["body"]) + return parsed["dataset"][0][0], result["status"] + + +def fetch_row_count(http_base, table_name, *, expected, timeout_sec=120): + """WAL-aware DEDUP count check. + + QuestDB WAL tables apply asynchronously, so an immediate count() right after + ingest can read 0 (apply lag). Poll until count() reaches ``expected`` (or + exceeds it, which signals at-least-once DEDUP inflation), or the timeout + elapses (which signals data loss). This makes the count() == rows gate + (plan s3.4) robust instead of racing the WAL. + """ + deadline = time.monotonic() + timeout_sec + actual, status = _count(http_base, table_name) + while actual < expected and time.monotonic() < deadline: + time.sleep(0.5) + actual, status = _count(http_base, table_name) + return { + "actual": actual, + "expected": expected, + "ok": actual == expected, + "inflated": actual > expected, + "status": status, + } + + +def main(): + parser = argparse.ArgumentParser( + description=( + "Start a local QuestDB fixture and run pandas columnar Layer 3 " + "real-row / real-client benchmarks.")) + parser.add_argument( + "--questdb-repo", + default="../questdb", + help=( + "Path to a built QuestDB repo containing " + "core/target/questdb-*-SNAPSHOT.jar.")) + parser.add_argument( + "--schema", + choices=sorted(SUPPORTED_SCHEMAS), + default="s1-narrow") + parser.add_argument("--rows", type=int, default=10000) + parser.add_argument("--iterations", type=int, default=3) + parser.add_argument("--warmups", type=int, default=1) + parser.add_argument( + "--sym-card", + type=int, + default=DEFAULT_SYM_CARD, + help="Low-cardinality SYMBOL `sym` (s1-narrow / s2-wide).") + parser.add_argument( + "--varchar-len", + type=int, + default=DEFAULT_VARCHAR_LEN, + help="VARCHAR byte length (s1-narrow / s2-wide).") + parser.add_argument( + "--varchar-charset", choices=["ascii", "unicode"], default="ascii", + help="VARCHAR note charset (unicode defeats the numpy ASCII fast path).") + parser.add_argument( + "--hi-sym-card", type=int, default=DEFAULT_HI_SYM_CARD, + help="s2-wide high-cardinality SYMBOLs s1..s5 (default 100k anchor).") + parser.add_argument( + "--path", choices=["real-row", "real-client"], action="append", + help="Ingest path(s). Default both; use --path real-client for large " + "rows (real-row is a single flush, capped at 16 MiB).") + parser.add_argument( + "--run-mode", + choices=["quick", "full"], + default="full", + help="Recorded in the JSON contract (plan s3.2).") + parser.add_argument("--pretty", action="store_true") + args = parser.parse_args() + + result = run_layer3(args) + print(json.dumps( + result, + indent=2 if args.pretty else None, + sort_keys=True)) + + +if __name__ == "__main__": + main() diff --git a/test/run_pandas_egress_layer3.py b/test/run_pandas_egress_layer3.py new file mode 100644 index 00000000..00d07156 --- /dev/null +++ b/test/run_pandas_egress_layer3.py @@ -0,0 +1,172 @@ +#!/usr/bin/env python3 +"""Step 2 egress real-server fixture (QWP_DATAFRAME_BENCH_PLAN.md s5.2/s5.4). + +Starts a local QuestDB, ingests the s1-narrow table (DEDUP UPSERT KEYS(ts), +monotonic-unique microsecond ts), waits for the WAL to apply and asserts +count() == rows, then reads the table back through the egress paths +(``benchmark_pandas_egress``) and emits the contract-conformant JSON. + +This reuses the Step 1 DEDUP spine end to end: write in Step 1, read in Step 2, +on the same server. No git-mutation of any QuestDB repo -- the fixture only +copies the prebuilt jar. +""" + +import argparse +import contextlib +import json +import pathlib +import sys + +sys.dont_write_bytecode = True + +import patch_path + +PROJ_ROOT = patch_path.PROJ_ROOT +sys.path.append(str(PROJ_ROOT / "c-questdb-client" / "system_test")) + +from fixture import QuestDbFixture, install_questdb_from_repo + +import questdb.ingress as qi +from benchmark_pandas_columnar import ( + DEFAULT_HI_SYM_CARD, + DEFAULT_SYM_CARD, + DEFAULT_VARCHAR_LEN, + build_schema_df, + run_real_client_path, + schema_sql_report, +) +from run_pandas_columnar_layer3 import fetch_http_endpoint, fetch_row_count +from benchmark_pandas_egress import ( + ALL_PATHS, + build_egress_report, + measure_egress_wire_bytes, + verify_zero_copy, +) + + +def run_layer3(args): + with contextlib.redirect_stdout(sys.stderr): + questdb_root = install_questdb_from_repo(pathlib.Path(args.questdb_repo)) + qdb = QuestDbFixture(questdb_root, auth=False, http=True, qwp_udp=False) + with contextlib.redirect_stdout(sys.stderr): + qdb.start() + try: + http_base = f"http://{qdb.host}:{qdb.http_server_port}" + conf = ( + f"qwpws::addr={qdb.host}:{qdb.http_server_port};" + "pool_size=1;pool_max=1;pool_reap=manual;") + schema = args.schema + df = build_schema_df( + schema, args.rows, + sym_card=args.sym_card, varchar_len=args.varchar_len, + varchar_charset=args.varchar_charset, + hi_sym_card=args.hi_sym_card) + sql = schema_sql_report(schema) + table_name = sql["table_name"] + setup_sqls = [sql["drop_sql"], sql["create_sql"]] + + # --- Ingest the S1 table (chunked real-client path; DEDUP-correct). --- + # No reset between iterations: we ingest once (warmups=0, iterations=1) + # so the table holds exactly `rows` to read back. + run_real_client_path( + df, args.rows, 1, 0, + conf=conf, table_name=table_name, http_base=http_base, + setup_sqls=setup_sqls, reset_sqls=()) + + # --- DEDUP gate: WAL-aware count() == rows (plan s3.4). --- + count_check = fetch_row_count( + http_base, table_name, expected=args.rows) + if not count_check["ok"]: + raise AssertionError( + f"egress fixture: count() mismatch on {table_name}: " + f"expected {count_check['expected']}, got " + f"{count_check['actual']} " + f"(inflated={count_check.get('inflated')})") + + # --- Read it back through the egress paths. --- + paths = args.path or ALL_PATHS + with qi.Client.from_conf(conf) as client: + read_sql = f"SELECT * FROM {table_name}" + wire_bytes = measure_egress_wire_bytes(client, read_sql) + zero_copy = verify_zero_copy(client, read_sql) + report = build_egress_report( + client=client, + table_name=table_name, + rows=args.rows, + columns=len(df.columns), + schema=schema, + iterations=args.iterations, + warmups=args.warmups, + run_mode=args.run_mode, + paths=paths, + wire_bytes=wire_bytes, + zero_copy=zero_copy, + extra={ + "questdb_version": ".".join( + str(part) for part in qdb.version), + "questdb_repo": str( + pathlib.Path(args.questdb_repo).resolve()), + "http_base": http_base, + "real_conf": conf, + "knobs": { + "sym_card": args.sym_card, + "varchar_len": args.varchar_len, + "varchar_charset": args.varchar_charset, + "hi_sym_card": args.hi_sym_card, + }, + "schema_sql": sql, + "row_count_check": count_check, + "settings": fetch_http_endpoint(http_base, "/settings"), + }) + return report + finally: + qdb.stop() + + +def main(): + parser = argparse.ArgumentParser( + description=( + "Start a local QuestDB, ingest the s1-narrow table, then run the " + "pandas egress read-back benchmark against it.")) + parser.add_argument( + "--questdb-repo", + default="../questdb", + help=( + "Path to a built QuestDB repo containing " + "core/target/questdb-*-SNAPSHOT.jar.")) + parser.add_argument( + "--schema", choices=["s1-narrow", "s2-wide"], default="s1-narrow", + help="DEDUP-correct schema to ingest then read back.") + parser.add_argument("--rows", type=int, default=100_000) + parser.add_argument("--iterations", type=int, default=10) + parser.add_argument("--warmups", type=int, default=2) + parser.add_argument( + "--sym-card", type=int, default=DEFAULT_SYM_CARD, + help="Low-cardinality SYMBOL `sym` (both schemas).") + parser.add_argument( + "--varchar-len", type=int, default=DEFAULT_VARCHAR_LEN) + parser.add_argument( + "--varchar-charset", choices=["ascii", "unicode"], default="ascii", + help="VARCHAR note content charset (unicode = non-ASCII codepoints, " + "defeats the numpy ASCII fast path; ~2x on-wire bytes).") + parser.add_argument( + "--hi-sym-card", type=int, default=DEFAULT_HI_SYM_CARD, + help="s2-wide high-cardinality SYMBOLs s1..s5 (uniform; default 100k " + "= the Go qwp-egress-read-wide anchor).") + parser.add_argument( + "--run-mode", choices=["quick", "full"], default="full") + parser.add_argument( + "--path", + choices=ALL_PATHS, + action="append", + help="Egress path(s) to run. Defaults to all paths.") + parser.add_argument("--pretty", action="store_true") + args = parser.parse_args() + + report = run_layer3(args) + print(json.dumps( + report, indent=2 if args.pretty else None, sort_keys=True)) + + +if __name__ == "__main__": + main() diff --git a/test/system_test.py b/test/system_test.py index 09acfd4d..9b584129 100755 --- a/test/system_test.py +++ b/test/system_test.py @@ -3,7 +3,13 @@ import sys sys.dont_write_bytecode = True import os +import datetime +import importlib.util +import random import shutil +import socket +import tempfile +import threading import unittest import uuid import pathlib @@ -29,11 +35,12 @@ import questdb.ingress as qi -QUESTDB_VERSION = '9.2.0' +QUESTDB_VERSION = '9.4.3' QUESTDB_PLAIN_INSTALL_PATH = None QUESTDB_AUTH_INSTALL_PATH = None FIRST_ARRAY_RELEASE = (8, 4, 0) FIRST_DECIMAL_RELEASE = (9, 2, 0) +FIRST_QWP_WS_RELEASE = (9, 4, 3) def may_install_questdb(): global QUESTDB_PLAIN_INSTALL_PATH @@ -69,9 +76,14 @@ def setUpClass(cls): cls.qdb_plain = None cls.qdb_auth = None + cls._qwp_udp_enabled = bool(os.environ.get('QDB_REPO_PATH')) cls.qdb_plain = QuestDbFixture( - QUESTDB_PLAIN_INSTALL_PATH, auth=False, wrap_tls=True, http=True) + QUESTDB_PLAIN_INSTALL_PATH, + auth=False, + wrap_tls=True, + http=True, + qwp_udp=cls._qwp_udp_enabled) cls.qdb_plain.start() cls.qdb_auth = QuestDbFixture( @@ -85,6 +97,95 @@ def tearDownClass(cls): if cls.qdb_plain: cls.qdb_plain.stop() + def _require_qwp_udp(self): + if not self.qdb_plain.qwp_udp: + self.skipTest( + 'QWP/UDP integration tests require repo-backed QWP receiver support') + + def _mk_qwpudp_sender(self, **kwargs): + self._require_qwp_udp() + return qi.Sender( + qi.Protocol.QwpUdp, + self.qdb_plain.host, + self.qdb_plain.qwp_udp_port, + **kwargs) + + def _mk_qwpudp_conf(self, **kwargs): + self._require_qwp_udp() + conf = f'qwpudp::addr={self.qdb_plain.host}:{self.qdb_plain.qwp_udp_port};' + for key, value in kwargs.items(): + conf += f'{key}={value};' + return conf + + def _require_qwp_ws(self): + if self.qdb_plain.version < FIRST_QWP_WS_RELEASE: + self.skipTest( + 'QWP/WebSocket integration tests require QuestDB 9.4.3+') + + def _require_qwp_fuzz(self): + self._require_qwp_ws() + + def _mk_qwpws_conf(self, sender_id, sf_dir, endpoints=None, **kwargs): + self._require_qwp_ws() + if endpoints is None: + endpoints = [ + (self.qdb_plain.host, self.qdb_plain.http_server_port)] + addr = ','.join( + f'{endpoint_host}:{endpoint_port}' + for endpoint_host, endpoint_port in endpoints) + conf = ( + f'qwpws::addr={addr};' + f'sender_id={sender_id};' + f'sf_dir={sf_dir};') + for key, value in kwargs.items(): + conf += f'{key}={value};' + return conf + + @staticmethod + def _micros_to_qdb_date(timestamp_us): + secs, remaining_us = divmod(timestamp_us, 1_000_000) + return datetime.datetime.fromtimestamp( + secs, datetime.timezone.utc).replace( + microsecond=remaining_us).strftime('%Y-%m-%dT%H:%M:%S.%fZ') + + @staticmethod + def _nanos_to_qdb_date(timestamp_ns): + secs, remaining_ns = divmod(timestamp_ns, 1_000_000_000) + base = datetime.datetime.fromtimestamp( + secs, datetime.timezone.utc).strftime('%Y-%m-%dT%H:%M:%S') + return f'{base}.{remaining_ns:09d}Z' + + @staticmethod + def _sfa_file_count(sf_dir, sender_id): + slot_dir = pathlib.Path(sf_dir) / sender_id + if not slot_dir.exists(): + return 0 + return sum(1 for path in slot_dir.iterdir() + if path.name.endswith('.sfa')) + + @staticmethod + def _unused_tcp_port(): + with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as sock: + sock.bind(('127.0.0.1', 0)) + return sock.getsockname()[1] + + def _retry_poll_qwp_ws_error(self, sender, timeout_sec=10): + import time as _time + deadline = _time.monotonic() + timeout_sec + while _time.monotonic() < deadline: + diagnostic = sender.poll_qwp_ws_error() + if diagnostic is not None: + return diagnostic + _time.sleep(0.05) + self.fail('Timed out waiting for QWP/WebSocket diagnostic') + + @staticmethod + def _qwp_fuzz_seed(): + seed_text = os.environ.get('QDB_PY_QWP_FUZZ_SEED') + if seed_text: + return int(seed_text, 0) + return 0x5151 + def _test_scenario(self, qdb, protocol, **kwargs): protocol = qi.Protocol.parse(protocol) port = qdb.tls_line_tcp_port if protocol.tls_enabled else qdb.line_tcp_port @@ -233,97 +334,4484 @@ def test_http(self): scrubbed_dataset = [row[:-1] for row in resp['dataset']] self.assertEqual(scrubbed_dataset, exp_dataset) - def test_f64_arr(self): - if self.qdb_plain.version < FIRST_ARRAY_RELEASE: - self.skipTest('old server does not support array') + def test_qwp_websocket_single_batch_round_trip(self): + self._require_qwp_ws() table_name = uuid.uuid4().hex - array1 = np.array( - [ - [[1.1, 2.2], [3.3, 4.4]], - [[5.5, 6.6], [7.7, 8.8]] - ], - dtype=np.float64 - ) - array2 = array1.T - array3 = array1[::-1, ::-1] - with qi.Sender('http', 'localhost', self.qdb_plain.http_server_port) as sender: + sender_id = 'py-smoke-' + uuid.uuid4().hex[:8] + self.qdb_plain.http_sql_query( + f'CREATE TABLE "{table_name}" ' + '(id LONG, val DOUBLE, timestamp TIMESTAMP) ' + 'TIMESTAMP(timestamp) PARTITION BY DAY WAL ' + 'DEDUP UPSERT KEYS(timestamp, id)') + with tempfile.TemporaryDirectory(prefix='py-qwp-ws-smoke-') as sf_dir: + conf = self._mk_qwpws_conf( + sender_id, + sf_dir, + reconnect_max_duration_millis=30000, + close_flush_timeout_millis=30000) + sender = qi.Sender.from_conf(conf) + try: + sender.establish() + for row_id in range(3): + sender.row( + table_name, + columns={ + 'id': row_id, + 'val': row_id * 0.5}, + at=qi.TimestampMicros( + 1_700_000_000_000_000 + row_id * 1000)) + fsn = sender.flush_and_get_fsn() + self.assertEqual(fsn, 0) + self.assertTrue(sender.await_acked_fsn(fsn, 30000)) + sender.close_drain() + finally: + sender.close(False) + + self.assertEqual(self._sfa_file_count(sf_dir, sender_id), 0) + + self.qdb_plain.retry_check_table(table_name, min_rows=3) + resp = self.qdb_plain.http_sql_query( + f"select id, val from '{table_name}' order by id") + self.assertEqual(resp['dataset'], [[0, 0.0], [1, 0.5], [2, 1.0]]) + + def test_qwp_websocket_dead_endpoint_failover_and_ack_progresses(self): + self._require_qwp_ws() + table_name = uuid.uuid4().hex + sender_id = 'py-failover-' + uuid.uuid4().hex[:8] + self.qdb_plain.http_sql_query( + f'CREATE TABLE "{table_name}" ' + '(id LONG, val DOUBLE, timestamp TIMESTAMP) ' + 'TIMESTAMP(timestamp) PARTITION BY DAY WAL ' + 'DEDUP UPSERT KEYS(timestamp, id)') + endpoints = [ + (self.qdb_plain.host, self._unused_tcp_port()), + (self.qdb_plain.host, self.qdb_plain.http_server_port)] + + with tempfile.TemporaryDirectory(prefix='py-qwp-ws-failover-') as sf_dir: + sender = qi.Sender.from_conf(self._mk_qwpws_conf( + sender_id, + sf_dir, + endpoints=endpoints, + reconnect_max_duration_millis=30000, + close_flush_timeout_millis=30000)) + try: + sender.establish() + sender.row( + table_name, + columns={'id': 0, 'val': 0.5}, + at=qi.TimestampMicros(1_700_000_000_000_000)) + fsn = sender.flush_and_get_fsn() + self.assertEqual(fsn, 0) + self.assertTrue(sender.await_acked_fsn(fsn, 30000)) + self.assertEqual(sender.acked_fsn(), fsn) + sender.close_drain() + finally: + sender.close(False) + + self.assertEqual(self._sfa_file_count(sf_dir, sender_id), 0) + + self.qdb_plain.retry_check_table(table_name, min_rows=1) + resp = self.qdb_plain.http_sql_query( + f"select id, val from '{table_name}'") + self.assertEqual(resp['dataset'], [[0, 0.5]]) + + def test_qwp_websocket_schema_evolution_across_batches(self): + self._require_qwp_ws() + table_name = uuid.uuid4().hex + sender_id = 'py-schema-' + uuid.uuid4().hex[:8] + + with tempfile.TemporaryDirectory(prefix='py-qwp-ws-schema-') as sf_dir: + sender = qi.Sender.from_conf(self._mk_qwpws_conf( + sender_id, + sf_dir, + reconnect_max_duration_millis=30000, + close_flush_timeout_millis=30000)) + try: + sender.establish() + sender.row( + table_name, + symbols={'host': 'r1'}, + at=qi.TimestampMicros(1_700_000_000_000_000)) + first_fsn = sender.flush_and_get_fsn() + self.assertEqual(first_fsn, 0) + + sender.row( + table_name, + symbols={'host': 'r2'}, + columns={'qty': 2, 'note': 'two'}, + at=qi.TimestampMicros(1_700_000_000_001_000)) + second_fsn = sender.flush_and_get_fsn() + self.assertEqual(second_fsn, 1) + + sender.row( + table_name, + symbols={'host': 'r3'}, + columns={'note': 'three'}, + at=qi.TimestampMicros(1_700_000_000_002_000)) + third_fsn = sender.flush_and_get_fsn() + self.assertEqual(third_fsn, 2) + + self.assertTrue(sender.await_acked_fsn(third_fsn, 30000)) + sender.close_drain() + finally: + sender.close(False) + + self.assertEqual(self._sfa_file_count(sf_dir, sender_id), 0) + + self.qdb_plain.retry_check_table(table_name, min_rows=3) + resp = self.qdb_plain.http_sql_query( + f"select host, qty, note from '{table_name}' order by host") + self.assertEqual(resp['dataset'], [ + ['r1', None, None], + ['r2', 2, 'two'], + ['r3', None, 'three']]) + + def test_qwp_websocket_write_rejection_drops_and_sender_continues(self): + self._require_qwp_ws() + table_name = uuid.uuid4().hex + sender_id = 'py-reject-' + uuid.uuid4().hex[:8] + self.qdb_plain.http_sql_query( + f'CREATE TABLE "{table_name}" ' + '(id LONG, px DOUBLE, bad LONG, timestamp TIMESTAMP) ' + 'TIMESTAMP(timestamp) PARTITION BY DAY WAL') + + with tempfile.TemporaryDirectory(prefix='py-qwp-ws-reject-') as sf_dir: + sender = qi.Sender.from_conf(self._mk_qwpws_conf( + sender_id, + sf_dir, + reconnect_max_duration_millis=30000, + close_flush_timeout_millis=30000)) + try: + sender.establish() + sender.row( + table_name, + columns={'id': 0, 'px': 10.5}, + at=qi.TimestampMicros(1_700_000_000_000_000)) + first_fsn = sender.flush_and_get_fsn() + + sender.row( + table_name, + columns={'id': 1, 'bad': 'not-a-long'}, + at=qi.TimestampMicros(1_700_000_000_001_000)) + rejected_fsn = sender.flush_and_get_fsn() + + sender.row( + table_name, + columns={'id': 2, 'px': 20.5}, + at=qi.TimestampMicros(1_700_000_000_002_000)) + final_fsn = sender.flush_and_get_fsn() + + self.assertEqual( + (first_fsn, rejected_fsn, final_fsn), + (0, 1, 2)) + self.assertTrue(sender.await_acked_fsn(final_fsn, 30000)) + diagnostic = self._retry_poll_qwp_ws_error(sender) + self.assertEqual( + diagnostic.category, + qi.QwpWsErrorCategory.SchemaMismatch) + self.assertEqual( + diagnostic.applied_policy, + qi.QwpWsErrorPolicy.DropAndContinue) + self.assertEqual(diagnostic.status, 0x03) + self.assertEqual(diagnostic.from_fsn, rejected_fsn) + self.assertEqual(diagnostic.to_fsn, rejected_fsn) + self.assertIsNone(sender.poll_qwp_ws_error()) + self.assertEqual(sender.qwp_ws_errors_dropped(), 0) + sender.close_drain() + finally: + sender.close(False) + + self.assertEqual(self._sfa_file_count(sf_dir, sender_id), 0) + + self.qdb_plain.retry_check_table(table_name, min_rows=2) + resp = self.qdb_plain.http_sql_query( + f"select id, px from '{table_name}' order by id") + self.assertEqual(resp['dataset'], [[0, 10.5], [2, 20.5]]) + + def test_qwp_websocket_error_handler_callback_fires(self): + self._require_qwp_ws() + table_name = uuid.uuid4().hex + sender_id = 'py-reject-cb-' + uuid.uuid4().hex[:8] + self.qdb_plain.http_sql_query( + f'CREATE TABLE "{table_name}" ' + '(id LONG, px DOUBLE, bad LONG, timestamp TIMESTAMP) ' + 'TIMESTAMP(timestamp) PARTITION BY DAY WAL') + + captured = [] + with tempfile.TemporaryDirectory(prefix='py-qwp-ws-reject-cb-') as sf_dir: + sender = qi.Sender.from_conf( + self._mk_qwpws_conf( + sender_id, + sf_dir, + reconnect_max_duration_millis=30000, + close_flush_timeout_millis=30000), + qwp_ws_error_handler=captured.append) + try: + sender.establish() + sender.row( + table_name, + columns={'id': 0, 'px': 10.5}, + at=qi.TimestampMicros(1_700_000_000_000_000)) + sender.flush_and_get_fsn() + sender.row( + table_name, + columns={'id': 1, 'bad': 'not-a-long'}, + at=qi.TimestampMicros(1_700_000_000_001_000)) + rejected_fsn = sender.flush_and_get_fsn() + sender.row( + table_name, + columns={'id': 2, 'px': 20.5}, + at=qi.TimestampMicros(1_700_000_000_002_000)) + final_fsn = sender.flush_and_get_fsn() + self.assertTrue(sender.await_acked_fsn(final_fsn, 30000)) + sender.close_drain() + self.assertTrue( + captured, 'qwp_ws_error_handler was never invoked') + diagnostic = captured[0] + self.assertEqual( + diagnostic.category, + qi.QwpWsErrorCategory.SchemaMismatch) + self.assertEqual(diagnostic.from_fsn, rejected_fsn) + self.assertEqual(diagnostic.to_fsn, rejected_fsn) + finally: + sender.close(False) + + self.qdb_plain.retry_check_table(table_name, min_rows=2) + resp = self.qdb_plain.http_sql_query( + f"select id, px from '{table_name}' order by id") + self.assertEqual(resp['dataset'], [[0, 10.5], [2, 20.5]]) + + def test_qwp_websocket_schema_fuzz(self): + self._require_qwp_fuzz() + seed = self._qwp_fuzz_seed() + rng = random.Random(seed) + sys.stderr.write(f'[qwp-python-fuzz seed] {seed:#x}\n') + sys.stderr.flush() + + rows = int(os.environ.get('QDB_PY_QWP_FUZZ_ROWS', '64')) + rows = max(8, rows) + table_count = int(os.environ.get('QDB_PY_QWP_FUZZ_TABLES', '2')) + table_count = max(1, table_count) + tables = [ + 'py_qwp_fuzz_' + uuid.uuid4().hex[:8] + for _ in range(table_count)] + expected = {table: [] for table in tables} + sender_id = 'py-fuzz-' + uuid.uuid4().hex[:8] + base_ts = 1_700_000_100_000_000 + host_values = ['alpha', 'beta value', 'Zürich', '東京'] + region_values = ['eu', 'us west', 'apac', 'münchen'] + note_values = ['plain', 'two words', '你好世界', 'emoji-🚀'] + + def append_row(sender, table, row_id, include_all=False): + row_ts = base_ts + row_id + row = { + 'id': row_id, + 'host': None, + 'region': None, + 'qty': None, + 'px': None, + 'note': None, + 'event_ts': None, + 'timestamp': self._micros_to_qdb_date(row_ts)} + symbols = {} + columns = {'id': row_id} + + if include_all or rng.randrange(4) != 0: + value = rng.choice(host_values) + symbols['host'] = value + row['host'] = value + if include_all or rng.randrange(2) == 0: + value = rng.choice(region_values) + symbols['region'] = value + row['region'] = value + + candidates = [ + ('qty', lambda: rng.randrange(-1000, 1000)), + ('px', lambda: round(rng.uniform(-1000.0, 1000.0), 6)), + ('note', lambda: rng.choice(note_values) + f'-{row_id}'), + ('event_ts', lambda: qi.TimestampMicros(row_ts + 123))] + rng.shuffle(candidates) + for name, value_factory in candidates: + if include_all or rng.randrange(3) != 0: + value = value_factory() + columns[name] = value + row[name] = ( + self._micros_to_qdb_date(value.value) + if isinstance(value, qi.TimestampMicros) + else value) + + sender.row( + table, + symbols=symbols, + columns=columns, + at=qi.TimestampMicros(row_ts)) + expected[table].append(row) + + with tempfile.TemporaryDirectory(prefix='py-qwp-ws-fuzz-') as sf_dir: + sender = qi.Sender.from_conf(self._mk_qwpws_conf( + sender_id, + sf_dir, + reconnect_max_duration_millis=30000, + close_flush_timeout_millis=30000)) + last_fsn = None + pending = 0 + try: + sender.establish() + next_flush_at = rng.randrange(3, 11) + for row_id in range(rows): + table = ( + tables[row_id % table_count] + if row_id < table_count + else rng.choice(tables)) + append_row(sender, table, row_id) + pending += 1 + if pending >= next_flush_at: + fsn = sender.flush_and_get_fsn() + self.assertIsNotNone(fsn) + if last_fsn is not None: + self.assertEqual(fsn, last_fsn + 1) + last_fsn = fsn + pending = 0 + next_flush_at = rng.randrange(3, 11) + + for table in tables: + append_row( + sender, + table, + rows + tables.index(table), + include_all=True) + pending += 1 + + if pending: + fsn = sender.flush_and_get_fsn() + self.assertIsNotNone(fsn) + if last_fsn is not None: + self.assertEqual(fsn, last_fsn + 1) + last_fsn = fsn + + self.assertIsNotNone(last_fsn) + self.assertTrue(sender.await_acked_fsn(last_fsn, 30000)) + self.assertIsNone(sender.poll_qwp_ws_error()) + self.assertEqual(sender.qwp_ws_errors_dropped(), 0) + sender.close_drain() + finally: + sender.close(False) + + self.assertEqual(self._sfa_file_count(sf_dir, sender_id), 0) + + for table in tables: + self.qdb_plain.retry_check_table( + table, + min_rows=len(expected[table])) + resp = self.qdb_plain.http_sql_query( + f"select id, host, region, qty, px, note, event_ts, timestamp " + f"from '{table}' order by id") + expected_rows = [ + [ + row['id'], + row['host'], + row['region'], + row['qty'], + row['px'], + row['note'], + row['event_ts'], + row['timestamp']] + for row in sorted(expected[table], key=lambda item: item['id'])] + self.assertEqual(resp['dataset'], expected_rows) + + def test_qwp_udp_protocol_enum(self): + self.assertEqual(qi.Protocol.parse('qwpudp'), qi.Protocol.QwpUdp) + self.assertFalse(qi.Protocol.QwpUdp.tls_enabled) + + def test_qwp_udp_basic(self): + self._require_qwp_udp() + table_name = uuid.uuid4().hex + with self._mk_qwpudp_sender() as sender: sender.row( table_name, - columns={ - 'f64_arr1': array1, - 'f64_arr2': array2, - 'f64_arr3': array3}, + symbols={'name_a': 'val_a'}, + columns={'name_b': True, 'name_c': 42, 'name_d': 2.5}, at=qi.ServerTimestamp) - resp = self.qdb_plain.retry_check_table(table_name) - exp_columns = [{'dim': 3, 'elemType': 'DOUBLE', 'name': 'f64_arr1', 'type': 'ARRAY'}, - {'dim': 3, 'elemType': 'DOUBLE', 'name': 'f64_arr2', 'type': 'ARRAY'}, - {'dim': 3, 'elemType': 'DOUBLE', 'name': 'f64_arr3', 'type': 'ARRAY'}, - {'name': 'timestamp', 'type': 'TIMESTAMP'}] - self.assertEqual(resp['columns'], exp_columns) - expected_data = [[[[[1.1, 2.2], [3.3, 4.4]], [[5.5, 6.6], [7.7, 8.8]]], - [[[1.1, 5.5], [3.3, 7.7]], [[2.2, 6.6], [4.4, 8.8]]], - [[[7.7, 8.8], [5.5, 6.6]], [[3.3, 4.4], [1.1, 2.2]]]]] - scrubbed_data = [row[:-1] for row in resp['dataset']] - self.assertEqual(scrubbed_data, expected_data) + self.assertEqual(bytes(sender), b'') + self.assertGreater(len(sender), 0) + sender.flush() + self.assertEqual(len(sender), 0) - def test_decimal_py_obj(self): - if self.qdb_plain.version < FIRST_DECIMAL_RELEASE: - self.skipTest('old server does not support decimal') + resp = self.qdb_plain.retry_check_table(table_name, min_rows=1) + exp_columns = [ + {'name': 'name_a', 'type': 'SYMBOL'}, + {'name': 'name_b', 'type': 'BOOLEAN'}, + {'name': 'name_c', 'type': 'LONG'}, + {'name': 'name_d', 'type': 'DOUBLE'}, + {'name': 'timestamp', 'type': 'TIMESTAMP'}] + self.assertEqual(resp['columns'], exp_columns) + scrubbed_dataset = [row[:-1] for row in resp['dataset']] + self.assertEqual(scrubbed_dataset, [['val_a', True, 42, 2.5]]) + def test_qwp_udp_from_conf_with_opts(self): + self._require_qwp_udp() table_name = uuid.uuid4().hex - self.qdb_plain.http_sql_query(f'CREATE TABLE {table_name} (dec_col DECIMAL(18,3), timestamp TIMESTAMP) TIMESTAMP(timestamp) PARTITION BY DAY;') + conf = self._mk_qwpudp_conf(max_datagram_size=1200, multicast_ttl=2) + with qi.Sender.from_conf(conf) as sender: + self.assertEqual(sender.auto_flush_bytes, 1200) + sender.row( + table_name, + columns={'price': 1.5}, + at=qi.ServerTimestamp) - pending = None - with qi.Sender('http', 'localhost', self.qdb_plain.http_server_port) as sender: + resp = self.qdb_plain.retry_check_table(table_name, min_rows=1) + scrubbed_dataset = [row[:-1] for row in resp['dataset']] + self.assertEqual(scrubbed_dataset, [[1.5]]) + + def test_qwp_udp_from_conf_override(self): + self._require_qwp_udp() + table_name = uuid.uuid4().hex + conf = self._mk_qwpudp_conf() + with qi.Sender.from_conf( + conf, + max_datagram_size=1200, + multicast_ttl=2) as sender: + self.assertEqual(sender.auto_flush_bytes, 1200) sender.row( table_name, - columns={ - 'dec_col': decimal.Decimal('12345.678')}, + columns={'price': 2.5}, at=qi.ServerTimestamp) - pending = bytes(sender) - - resp = self.qdb_plain.retry_check_table(table_name, min_rows=1, log_ctx=pending) - exp_columns = [{'name': 'dec_col', 'type': 'DECIMAL(18,3)'}, - {'name': 'timestamp', 'type': 'TIMESTAMP'}] - self.assertEqual(resp['columns'], exp_columns) - expected_data = [['12345.678']] - scrubbed_data = [row[:-1] for row in resp['dataset']] - self.assertEqual(scrubbed_data, expected_data) - @unittest.skipIf(not pyarrow, 'pyarrow not installed') - @unittest.skipIf(not pd, 'pandas not installed') - def test_decimal_pyarrow(self): - if self.qdb_plain.version < FIRST_DECIMAL_RELEASE: - self.skipTest('old server does not support decimal') + resp = self.qdb_plain.retry_check_table(table_name, min_rows=1) + scrubbed_dataset = [row[:-1] for row in resp['dataset']] + self.assertEqual(scrubbed_dataset, [[2.5]]) + def test_qwp_udp_from_env_override(self): + self._require_qwp_udp() table_name = uuid.uuid4().hex - self.qdb_plain.http_sql_query(f'CREATE TABLE {table_name} (prices DECIMAL(18,3), timestamp TIMESTAMP) TIMESTAMP(timestamp) PARTITION BY DAY;') + old_conf = os.environ.get('QDB_CLIENT_CONF') + os.environ['QDB_CLIENT_CONF'] = self._mk_qwpudp_conf() + try: + with qi.Sender.from_env( + max_datagram_size=1200, + multicast_ttl=2) as sender: + self.assertEqual(sender.auto_flush_bytes, 1200) + sender.row( + table_name, + columns={'price': 4.5}, + at=qi.ServerTimestamp) + finally: + if old_conf is None: + del os.environ['QDB_CLIENT_CONF'] + else: + os.environ['QDB_CLIENT_CONF'] = old_conf + + resp = self.qdb_plain.retry_check_table(table_name, min_rows=1) + scrubbed_dataset = [row[:-1] for row in resp['dataset']] + self.assertEqual(scrubbed_dataset, [[4.5]]) + + def test_qwp_udp_from_conf_override_conflict(self): + self._require_qwp_udp() + conf = self._mk_qwpudp_conf(max_datagram_size=1200) + with self.assertRaisesRegex( + ValueError, + r'"max_datagram_size" is already present in the conf_str'): + qi.Sender.from_conf(conf, max_datagram_size=900) + + def test_qwp_udp_auto_flush_bytes_default(self): + self._require_qwp_udp() + sender = self._mk_qwpudp_sender() + try: + self.assertTrue(sender.auto_flush) + self.assertEqual(sender.auto_flush_bytes, 1400) + finally: + sender.close(flush=False) + + sender = self._mk_qwpudp_sender(max_datagram_size=1200) + try: + self.assertEqual(sender.auto_flush_bytes, 1200) + finally: + sender.close(flush=False) + + def test_qwp_udp_new_buffer(self): + self._require_qwp_udp() + table_name = uuid.uuid4().hex + with self._mk_qwpudp_sender(init_buf_size=1024, max_name_len=64) as sender: + buffer = sender.new_buffer() + self.assertEqual(buffer.init_buf_size, 1024) + self.assertEqual(buffer.max_name_len, 64) + buffer.row( + table_name, + columns={'price': 3.5}, + at=qi.ServerTimestamp) + self.assertEqual(bytes(buffer), b'') + self.assertGreater(len(buffer), 0) + sender.flush(buffer) + self.assertEqual(len(buffer), 0) + + resp = self.qdb_plain.retry_check_table(table_name, min_rows=1) + scrubbed_dataset = [row[:-1] for row in resp['dataset']] + self.assertEqual(scrubbed_dataset, [[3.5]]) + + def test_qwp_udp_new_buffer_requires_establish(self): + self._require_qwp_udp() + sender = self._mk_qwpudp_sender() + try: + with self.assertRaisesRegex( + qi.IngressError, + r"new_buffer\(\) can't be called before establish\(\)"): + sender.new_buffer() + finally: + sender.close(flush=False) + + def test_qwp_udp_new_buffer_rejects_closed_sender(self): + self._require_qwp_udp() + sender = self._mk_qwpudp_sender() + sender.close(flush=False) + with self.assertRaisesRegex( + qi.IngressError, + r"new_buffer\(\) can't be called: Sender is closed"): + sender.new_buffer() + + def test_qwp_udp_transaction_rejected(self): + self._require_qwp_udp() + with self._mk_qwpudp_sender() as sender: + with self.assertRaisesRegex( + qi.IngressError, + 'Transactions are only supported for ILP/HTTP'): + sender.transaction('trades') + + def test_qwp_udp_protocol_version_rejected(self): + self._require_qwp_udp() + with self._mk_qwpudp_sender() as sender: + with self.assertRaisesRegex( + qi.IngressError, + 'protocol_version is not applicable for QWP/UDP senders'): + sender.protocol_version + + def test_qwp_udp_example(self): + self._require_qwp_udp() + table_name = uuid.uuid4().hex + example_path = PROJ_ROOT / 'examples' / 'qwp_udp.py' + spec = importlib.util.spec_from_file_location( + 'questdb_qwp_udp_example', + example_path) + self.assertIsNotNone(spec) + self.assertIsNotNone(spec.loader) + mod = importlib.util.module_from_spec(spec) + spec.loader.exec_module(mod) + + mod.example( + host=self.qdb_plain.host, + port=self.qdb_plain.qwp_udp_port, + table_name=table_name) + + resp = self.qdb_plain.retry_check_table(table_name, min_rows=1) + scrubbed_dataset = [row[:-1] for row in resp['dataset']] + self.assertEqual(scrubbed_dataset, [['ETH-USD', 'sell', 2615.54, 0.00044]]) + @unittest.skipIf(not pd, 'pandas not installed') + def test_qwp_udp_dataframe(self): + self._require_qwp_udp() + table_name = uuid.uuid4().hex df = pd.DataFrame({ - 'prices': pd.array( - [ - decimal.Decimal('-99999.99'), - decimal.Decimal('-678'), - ], - dtype=pd.ArrowDtype(pyarrow.decimal128(18, 2)) - ) + 'name_a': ['a', 'b'], + 'name_b': [True, False], + 'name_c': [1, 2], + 'name_d': [1.5, 2.5], }) - - pending = None - with qi.Sender('http', 'localhost', self.qdb_plain.http_server_port) as sender: + with self._mk_qwpudp_sender() as sender: sender.dataframe(df, table_name=table_name, at=qi.ServerTimestamp) - pending = bytes(sender) - resp = self.qdb_plain.retry_check_table(table_name, min_rows=2, log_ctx=pending) - exp_columns = [{'name': 'prices', 'type': 'DECIMAL(18,3)'}, - {'name': 'timestamp', 'type': 'TIMESTAMP'}] + resp = self.qdb_plain.retry_check_table(table_name, min_rows=2) + exp_columns = [ + {'name': 'name_a', 'type': 'VARCHAR'}, + {'name': 'name_b', 'type': 'BOOLEAN'}, + {'name': 'name_c', 'type': 'LONG'}, + {'name': 'name_d', 'type': 'DOUBLE'}, + {'name': 'timestamp', 'type': 'TIMESTAMP'}] self.assertEqual(resp['columns'], exp_columns) - expected_data = [ - ['-99999.990'], - ['-678.000'], - ] - scrubbed_data = [row[:-1] for row in resp['dataset']] - self.assertEqual(scrubbed_data, expected_data) + scrubbed_dataset = [row[:-1] for row in resp['dataset']] + self.assertEqual(scrubbed_dataset, [['a', True, 1, 1.5], ['b', False, 2, 2.5]]) + + def test_qwp_udp_timestamp_columns(self): + self._require_qwp_udp() + table_name = uuid.uuid4().hex + ts_micros = qi.TimestampMicros(1_700_000_000_000_000) + ts_nanos = qi.TimestampNanos(1_700_000_000_123_456_789) + dt = datetime.datetime(2024, 6, 15, 12, 0, 0, tzinfo=datetime.timezone.utc) + with self._mk_qwpudp_sender() as sender: + sender.row( + table_name, + columns={ + 'ts_micros': ts_micros, + 'ts_nanos': ts_nanos, + 'ts_dt': dt}, + at=qi.TimestampNanos.now()) + sender.flush() + + resp = self.qdb_plain.retry_check_table(table_name, min_rows=1) + col_types = {c['name']: c['type'] for c in resp['columns']} + self.assertEqual(col_types['ts_micros'], 'TIMESTAMP') + self.assertEqual(col_types['ts_nanos'], 'TIMESTAMP_NS') + self.assertEqual(col_types['ts_dt'], 'TIMESTAMP') + row = resp['dataset'][0] + # ts_micros: 1_700_000_000_000_000 micros + self.assertEqual(row[0], '2023-11-14T22:13:20.000000Z') + # ts_dt: 2024-06-15T12:00:00Z + self.assertEqual(row[2], '2024-06-15T12:00:00.000000Z') + + def test_qwp_udp_timestamp_columns_convert_into_existing_table_types(self): + self._require_qwp_udp() + micros_table = uuid.uuid4().hex + nanos_table = uuid.uuid4().hex + event_ts_us = 123_456 + event_ts_ns = 123_456_789 + row_ts_us = 1_700_000_000_000_123 + self.qdb_plain.http_sql_query( + f'CREATE TABLE {micros_table} ' + f'(host SYMBOL, event_ts TIMESTAMP_NS, timestamp TIMESTAMP) ' + f'TIMESTAMP(timestamp) PARTITION BY DAY;') + self.qdb_plain.http_sql_query( + f'CREATE TABLE {nanos_table} ' + f'(host SYMBOL, event_ts TIMESTAMP, timestamp TIMESTAMP) ' + f'TIMESTAMP(timestamp) PARTITION BY DAY;') + + with self._mk_qwpudp_sender() as sender: + sender.row( + micros_table, + symbols={'host': 'micro'}, + columns={'event_ts': qi.TimestampMicros(event_ts_us)}, + at=qi.TimestampMicros(row_ts_us)) + sender.row( + nanos_table, + symbols={'host': 'nano'}, + columns={'event_ts': qi.TimestampNanos(event_ts_ns)}, + at=qi.TimestampMicros(row_ts_us)) + sender.flush() + + self.qdb_plain.retry_check_table(micros_table, min_rows=1) + self.qdb_plain.retry_check_table(nanos_table, min_rows=1) + micros_resp = self.qdb_plain.http_sql_query( + f"select host, event_ts, timestamp from '{micros_table}'") + nanos_resp = self.qdb_plain.http_sql_query( + f"select host, event_ts, timestamp from '{nanos_table}'") + self.assertEqual(micros_resp['dataset'], [[ + 'micro', + self._nanos_to_qdb_date(event_ts_us * 1000), + self._micros_to_qdb_date(row_ts_us)]]) + self.assertEqual(nanos_resp['dataset'], [[ + 'nano', + self._micros_to_qdb_date(event_ts_ns // 1000), + self._micros_to_qdb_date(row_ts_us)]]) + + def test_qwp_udp_mixed_timestamp_precisions_rejected(self): + self._require_qwp_udp() + with self.assertRaisesRegex( + qi.IngressError, + 'designated timestamp changes type within a batched table'): + with self._mk_qwpudp_sender() as sender: + sender.row( + 'mixed_ts_designated', + columns={'qty': 1}, + at=qi.TimestampMicros(123_456)) + sender.row( + 'mixed_ts_designated', + columns={'qty': 2}, + at=qi.TimestampNanos(789_000)) + sender.flush() + + with self.assertRaisesRegex( + qi.IngressError, + 'column "event_ts" changes type within a batched table'): + with self._mk_qwpudp_sender() as sender: + sender.row( + 'mixed_ts_column', + columns={'event_ts': qi.TimestampMicros(123_456)}, + at=qi.ServerTimestamp) + sender.row( + 'mixed_ts_column', + columns={'event_ts': qi.TimestampNanos(789_000)}, + at=qi.ServerTimestamp) + sender.flush() + + def test_qwp_udp_f64_array(self): + self._require_qwp_udp() + if self.qdb_plain.version < FIRST_ARRAY_RELEASE: + self.skipTest('old server does not support array') + table_name = uuid.uuid4().hex + array1 = np.array([[1.1, 2.2], [3.3, 4.4]], dtype=np.float64) + array2 = array1.T # non-contiguous + with self._mk_qwpudp_sender() as sender: + sender.row( + table_name, + columns={ + 'arr_c': array1, + 'arr_t': array2}, + at=qi.TimestampNanos.now()) + sender.flush() + + resp = self.qdb_plain.retry_check_table(table_name, min_rows=1) + col_types = {c['name']: c['type'] for c in resp['columns']} + self.assertEqual(col_types['arr_c'], 'ARRAY') + self.assertEqual(col_types['arr_t'], 'ARRAY') + scrubbed = [row[:-1] for row in resp['dataset']] + self.assertEqual(scrubbed, [[[[1.1, 2.2], [3.3, 4.4]], + [[1.1, 3.3], [2.2, 4.4]]]]) + + def test_qwp_udp_decimal(self): + self._require_qwp_udp() + if self.qdb_plain.version < FIRST_DECIMAL_RELEASE: + self.skipTest('old server does not support decimal') + table_name = uuid.uuid4().hex + self.qdb_plain.http_sql_query( + f'CREATE TABLE {table_name} ' + f'(price DECIMAL(18,3), timestamp TIMESTAMP) ' + f'TIMESTAMP(timestamp) PARTITION BY DAY;') + with self._mk_qwpudp_sender() as sender: + sender.row( + table_name, + columns={'price': decimal.Decimal('12345.678')}, + at=qi.TimestampNanos.now()) + sender.flush() + + resp = self.qdb_plain.retry_check_table(table_name, min_rows=1) + exp_columns = [ + {'name': 'price', 'type': 'DECIMAL(18,3)'}, + {'name': 'timestamp', 'type': 'TIMESTAMP'}] + self.assertEqual(resp['columns'], exp_columns) + scrubbed = [row[:-1] for row in resp['dataset']] + self.assertEqual(scrubbed, [['12345.678']]) + + def test_qwp_udp_string_column(self): + self._require_qwp_udp() + table_name = uuid.uuid4().hex + with self._mk_qwpudp_sender() as sender: + sender.row( + table_name, + columns={'label': 'hello world', 'value': 42}, + at=qi.TimestampNanos.now()) + sender.flush() + + resp = self.qdb_plain.retry_check_table(table_name, min_rows=1) + col_types = {c['name']: c['type'] for c in resp['columns']} + self.assertEqual(col_types['label'], 'VARCHAR') + self.assertEqual(col_types['value'], 'LONG') + scrubbed = [row[:-1] for row in resp['dataset']] + self.assertEqual(scrubbed, [['hello world', 42]]) + + def test_qwp_udp_auto_flush_bytes_triggers(self): + self._require_qwp_udp() + table_name = uuid.uuid4().hex + with self._mk_qwpudp_sender( + max_datagram_size=200, + auto_flush_rows=False, + auto_flush_interval=False) as sender: + self.assertEqual(sender.auto_flush_bytes, 200) + for i in range(20): + sender.row( + table_name, + symbols={'tag': f'v_{i}'}, + columns={'value': i}, + at=qi.TimestampNanos.now()) + resp = self.qdb_plain.retry_check_table(table_name, min_rows=10) + self.assertGreaterEqual(resp['count'], 10) + + def test_qwp_udp_auto_flush_rows_triggers(self): + self._require_qwp_udp() + table_name = uuid.uuid4().hex + with self._mk_qwpudp_sender( + auto_flush_rows=5, + auto_flush_bytes=False, + auto_flush_interval=False) as sender: + for i in range(10): + sender.row( + table_name, + columns={'value': i}, + at=qi.TimestampNanos.now()) + resp = self.qdb_plain.retry_check_table(table_name, min_rows=10) + self.assertEqual(resp['count'], 10) + + def test_qwp_udp_auto_flush_disabled(self): + self._require_qwp_udp() + table_name = uuid.uuid4().hex + sender = self._mk_qwpudp_sender(auto_flush=False) + sender.establish() + try: + for i in range(5): + sender.row( + table_name, + columns={'value': i}, + at=qi.TimestampNanos.now()) + self.assertGreater(len(sender), 0) + sender.flush() + finally: + sender.close(flush=False) + resp = self.qdb_plain.retry_check_table(table_name, min_rows=5) + self.assertEqual(resp['count'], 5) + + def test_qwp_udp_multi_table(self): + self._require_qwp_udp() + t1 = uuid.uuid4().hex + t2 = uuid.uuid4().hex + with self._mk_qwpudp_sender() as sender: + sender.row(t1, columns={'x': 1}, at=qi.TimestampNanos.now()) + sender.row(t2, columns={'y': 2}, at=qi.TimestampNanos.now()) + sender.row(t1, columns={'x': 3}, at=qi.TimestampNanos.now()) + sender.flush() + r1 = self.qdb_plain.retry_check_table(t1, min_rows=2) + r2 = self.qdb_plain.retry_check_table(t2, min_rows=1) + self.assertEqual(r1['count'], 2) + self.assertEqual(r2['count'], 1) + + def test_qwp_udp_buffer_reuse_after_flush(self): + self._require_qwp_udp() + t1 = uuid.uuid4().hex + t2 = uuid.uuid4().hex + with self._mk_qwpudp_sender() as sender: + buf = sender.new_buffer() + buf.row(t1, columns={'batch': 1}, at=qi.TimestampNanos.now()) + sender.flush(buf) + self.assertEqual(len(buf), 0) + buf.row(t2, columns={'batch': 2}, at=qi.TimestampNanos.now()) + sender.flush(buf) + r1 = self.qdb_plain.retry_check_table(t1, min_rows=1) + r2 = self.qdb_plain.retry_check_table(t2, min_rows=1) + self.assertEqual([row[:-1] for row in r1['dataset']], [[1]]) + self.assertEqual([row[:-1] for row in r2['dataset']], [[2]]) + + def test_qwp_udp_independent_buffers(self): + self._require_qwp_udp() + t1 = uuid.uuid4().hex + t2 = uuid.uuid4().hex + with self._mk_qwpudp_sender() as sender: + buf_a = sender.new_buffer() + buf_b = sender.new_buffer() + buf_a.row(t1, columns={'src': 'a'}, at=qi.TimestampNanos.now()) + buf_b.row(t2, columns={'src': 'b'}, at=qi.TimestampNanos.now()) + sender.flush(buf_a) + self.assertEqual(len(buf_a), 0) + self.assertGreater(len(buf_b), 0) + sender.flush(buf_b) + r1 = self.qdb_plain.retry_check_table(t1, min_rows=1) + r2 = self.qdb_plain.retry_check_table(t2, min_rows=1) + self.assertEqual([row[:-1] for row in r1['dataset']], [['a']]) + self.assertEqual([row[:-1] for row in r2['dataset']], [['b']]) + + def test_qwp_udp_flush_clear_false(self): + self._require_qwp_udp() + table_name = uuid.uuid4().hex + row_ts = qi.TimestampMicros(1_700_000_000_200_000) + with self._mk_qwpudp_sender() as sender: + buf = sender.new_buffer() + buf.row(table_name, columns={'val': 99}, at=row_ts) + sender.flush(buf, clear=False) + self.assertGreater(len(buf), 0) + sender.flush(buf) + self.assertEqual(len(buf), 0) + resp = self.qdb_plain.retry_check_table(table_name, min_rows=2) + self.assertEqual(resp['dataset'], [ + [99, self._micros_to_qdb_date(row_ts.value)], + [99, self._micros_to_qdb_date(row_ts.value)]]) + + def test_qwp_udp_unicode(self): + self._require_qwp_udp() + table_name = uuid.uuid4().hex + with self._mk_qwpudp_sender() as sender: + sender.row( + table_name, + symbols={'city': 'Zürich'}, + columns={'greeting': '你好世界', 'emoji': '🚀'}, + at=qi.TimestampNanos.now()) + sender.flush() + resp = self.qdb_plain.retry_check_table(table_name, min_rows=1) + row = resp['dataset'][0] + self.assertEqual(row[0], 'Zürich') + self.assertEqual(row[1], '你好世界') + self.assertEqual(row[2], '🚀') + + def test_qwp_udp_none_columns_skipped(self): + self._require_qwp_udp() + table_name = uuid.uuid4().hex + with self._mk_qwpudp_sender() as sender: + sender.row( + table_name, + symbols={'tag': 'a', 'skip_sym': None}, + columns={'present': 42, 'absent': None}, + at=qi.TimestampNanos.now()) + sender.flush() + resp = self.qdb_plain.retry_check_table(table_name, min_rows=1) + col_names = [c['name'] for c in resp['columns']] + self.assertIn('present', col_names) + self.assertNotIn('absent', col_names) + self.assertNotIn('skip_sym', col_names) + + def test_qwp_udp_schema_expansion_backfills_rows(self): + self._require_qwp_udp() + table_name = uuid.uuid4().hex + with self._mk_qwpudp_sender() as sender: + sender.row(table_name, symbols={'host': 'r1'}, at=qi.ServerTimestamp) + sender.row( + table_name, + symbols={'host': 'r2'}, + columns={'qty': 2, 'note': 'two'}, + at=qi.ServerTimestamp) + sender.row( + table_name, + symbols={'host': 'r3'}, + columns={'note': 'three'}, + at=qi.ServerTimestamp) + sender.flush() + + self.qdb_plain.retry_check_table(table_name, min_rows=3) + resp = self.qdb_plain.http_sql_query( + f"select host, qty, note from '{table_name}' order by host") + self.assertEqual(resp['dataset'], [ + ['r1', None, None], + ['r2', 2, 'two'], + ['r3', None, 'three']]) + + def test_qwp_udp_sparse_boolean_columns_fill_false(self): + self._require_qwp_udp() + table_name = uuid.uuid4().hex + with self._mk_qwpudp_sender() as sender: + sender.row(table_name, symbols={'host': 'r1'}, at=qi.ServerTimestamp) + sender.row( + table_name, + symbols={'host': 'r2'}, + columns={'active': True}, + at=qi.ServerTimestamp) + sender.row( + table_name, + symbols={'host': 'r3'}, + columns={'active': False}, + at=qi.ServerTimestamp) + sender.flush() + + self.qdb_plain.retry_check_table(table_name, min_rows=3) + resp = self.qdb_plain.http_sql_query( + f"select host, active from '{table_name}' order by host") + self.assertEqual(resp['dataset'], [ + ['r1', False], + ['r2', True], + ['r3', False]]) + + def test_qwp_udp_sparse_numeric_and_timestamp_columns_fill_null(self): + self._require_qwp_udp() + table_name = uuid.uuid4().hex + event_ts = qi.TimestampMicros(123_456) + with self._mk_qwpudp_sender() as sender: + sender.row(table_name, symbols={'host': 'r1'}, at=qi.ServerTimestamp) + sender.row( + table_name, + symbols={'host': 'r2'}, + columns={'qty': 2, 'event_ts': event_ts}, + at=qi.ServerTimestamp) + sender.row( + table_name, + symbols={'host': 'r3'}, + columns={'temp': 33.5}, + at=qi.ServerTimestamp) + sender.flush() + + self.qdb_plain.retry_check_table(table_name, min_rows=3) + resp = self.qdb_plain.http_sql_query( + f"select host, qty, temp, event_ts from '{table_name}' order by host") + self.assertEqual(resp['dataset'], [ + ['r1', None, None, None], + ['r2', 2, None, self._micros_to_qdb_date(event_ts.value)], + ['r3', None, 33.5, None]]) + + def test_qwp_udp_empty_flush(self): + self._require_qwp_udp() + with self._mk_qwpudp_sender() as sender: + self.assertEqual(len(sender), 0) + sender.flush() + sender.flush() + buf = sender.new_buffer() + sender.flush(buf) + + def test_qwp_udp_double_close(self): + self._require_qwp_udp() + sender = self._mk_qwpudp_sender() + sender.establish() + sender.close(flush=False) + sender.close(flush=False) + + def test_qwp_udp_context_manager_flush_on_exit(self): + self._require_qwp_udp() + table_name = uuid.uuid4().hex + with self._mk_qwpudp_sender(auto_flush=False) as sender: + sender.row( + table_name, columns={'val': 7}, + at=qi.TimestampNanos.now()) + resp = self.qdb_plain.retry_check_table(table_name, min_rows=1) + self.assertEqual([row[:-1] for row in resp['dataset']], [[7]]) + + def test_qwp_udp_server_vs_explicit_timestamp(self): + self._require_qwp_udp() + t1 = uuid.uuid4().hex + t2 = uuid.uuid4().hex + explicit_ts = qi.TimestampNanos(1_700_000_000_000_000_000) + with self._mk_qwpudp_sender() as sender: + sender.row(t1, columns={'x': 1}, at=qi.ServerTimestamp) + sender.row(t2, columns={'x': 2}, at=explicit_ts) + sender.flush() + r2 = self.qdb_plain.retry_check_table(t2, min_rows=1) + ts = r2['dataset'][0][1] + self.assertIn('2023-11-14', ts) + + def test_qwp_udp_many_rows(self): + self._require_qwp_udp() + table_name = uuid.uuid4().hex + with self._mk_qwpudp_sender() as sender: + for i in range(500): + sender.row( + table_name, + symbols={'batch': 'stress'}, + columns={'seq': i, 'payload': f'row_{i:04d}'}, + at=qi.TimestampNanos.now()) + resp = self.qdb_plain.retry_check_table(table_name, min_rows=500) + self.assertEqual(resp['count'], 500) + + def test_qwp_udp_max_name_len(self): + self._require_qwp_udp() + with self._mk_qwpudp_sender(max_name_len=20) as sender: + buf = sender.new_buffer() + buf.row('t', columns={'a' * 20: 1}, at=qi.ServerTimestamp) + self.assertGreater(len(buf), 0) + + buf2 = sender.new_buffer() + with self.assertRaises(qi.IngressError): + buf2.row('t', columns={'a' * 21: 1}, at=qi.ServerTimestamp) + + def test_qwp_udp_standalone_buffer_reuse(self): + self._require_qwp_udp() + t1 = uuid.uuid4().hex + t2 = uuid.uuid4().hex + buf = qi.Buffer.qwp() + buf.row(t1, columns={'round': 1}, at=qi.TimestampNanos.now()) + with self._mk_qwpudp_sender() as sender: + sender.flush(buf) + self.assertEqual(len(buf), 0) + buf.row(t2, columns={'round': 2}, at=qi.TimestampNanos.now()) + sender.flush(buf) + r1 = self.qdb_plain.retry_check_table(t1, min_rows=1) + r2 = self.qdb_plain.retry_check_table(t2, min_rows=1) + self.assertEqual([row[:-1] for row in r1['dataset']], [[1]]) + self.assertEqual([row[:-1] for row in r2['dataset']], [[2]]) + + def test_qwp_udp_auto_flush_interval(self): + self._require_qwp_udp() + table_name = uuid.uuid4().hex + import time as _time + with self._mk_qwpudp_sender( + auto_flush_rows=False, + auto_flush_bytes=False, + auto_flush_interval=500) as sender: + sender.row( + table_name, columns={'seq': 1}, + at=qi.TimestampNanos.now()) + self.assertGreater(len(sender), 0) + _time.sleep(0.7) + sender.row( + table_name, columns={'seq': 2}, + at=qi.TimestampNanos.now()) + resp = self.qdb_plain.retry_check_table(table_name, min_rows=2) + self.assertEqual(resp['count'], 2) + + def test_qwp_udp_datagram_splitting(self): + self._require_qwp_udp() + table_name = uuid.uuid4().hex + with self._mk_qwpudp_sender( + max_datagram_size=200, + auto_flush=False) as sender: + for i in range(30): + sender.row( + table_name, + symbols={'tag': f'val_{i:03d}'}, + columns={'seq': i, 'data': f'payload_{i:06d}'}, + at=qi.TimestampNanos.now()) + sender.flush() + resp = self.qdb_plain.retry_check_table(table_name, min_rows=30) + self.assertEqual(resp['count'], 30) + + def test_qwp_udp_interleave_with_http(self): + self._require_qwp_udp() + t_http = uuid.uuid4().hex + t_qwp = uuid.uuid4().hex + with qi.Sender( + qi.Protocol.Http, self.qdb_plain.host, + self.qdb_plain.http_server_port) as http_sender, \ + self._mk_qwpudp_sender() as qwp_sender: + http_sender.row( + t_http, columns={'src': 'http', 'val': 1}, + at=qi.TimestampNanos.now()) + qwp_sender.row( + t_qwp, columns={'src': 'qwp', 'val': 2}, + at=qi.TimestampNanos.now()) + qwp_sender.flush() + r_http = self.qdb_plain.retry_check_table(t_http, min_rows=1) + r_qwp = self.qdb_plain.retry_check_table(t_qwp, min_rows=1) + self.assertEqual(r_http['dataset'][0][0], 'http') + self.assertEqual(r_qwp['dataset'][0][0], 'qwp') + + def test_qwp_udp_from_env(self): + self._require_qwp_udp() + table_name = uuid.uuid4().hex + old = os.environ.get('QDB_CLIENT_CONF') + os.environ['QDB_CLIENT_CONF'] = self._mk_qwpudp_conf() + try: + with qi.Sender.from_env() as sender: + sender.row( + table_name, columns={'val': 123}, + at=qi.TimestampNanos.now()) + sender.flush() + finally: + if old is None: + del os.environ['QDB_CLIENT_CONF'] + else: + os.environ['QDB_CLIENT_CONF'] = old + resp = self.qdb_plain.retry_check_table(table_name, min_rows=1) + self.assertEqual( + [row[:-1] for row in resp['dataset']], [[123]]) + + def test_qwp_udp_sender_reuse(self): + self._require_qwp_udp() + t1 = uuid.uuid4().hex + t2 = uuid.uuid4().hex + with self._mk_qwpudp_sender() as sender: + sender.row(t1, columns={'session': 1}, + at=qi.TimestampNanos.now()) + sender.flush() + with self._mk_qwpudp_sender() as sender: + sender.row(t2, columns={'session': 2}, + at=qi.TimestampNanos.now()) + sender.flush() + r1 = self.qdb_plain.retry_check_table(t1, min_rows=1) + r2 = self.qdb_plain.retry_check_table(t2, min_rows=1) + self.assertEqual([row[:-1] for row in r1['dataset']], [[1]]) + self.assertEqual([row[:-1] for row in r2['dataset']], [[2]]) + + def test_qwp_udp_large_string(self): + self._require_qwp_udp() + table_name = uuid.uuid4().hex + big_str = 'x' * 1000 + with self._mk_qwpudp_sender() as sender: + sender.row( + table_name, columns={'payload': big_str}, + at=qi.TimestampNanos.now()) + sender.flush() + resp = self.qdb_plain.retry_check_table(table_name, min_rows=1) + self.assertEqual(resp['dataset'][0][0], big_str) + + def test_qwp_udp_symbols_only(self): + self._require_qwp_udp() + table_name = uuid.uuid4().hex + with self._mk_qwpudp_sender() as sender: + sender.row( + table_name, + symbols={'exchange': 'NYSE', 'ticker': 'AAPL'}, + at=qi.TimestampNanos.now()) + sender.flush() + resp = self.qdb_plain.retry_check_table(table_name, min_rows=1) + col_types = {c['name']: c['type'] for c in resp['columns']} + self.assertEqual(col_types['exchange'], 'SYMBOL') + self.assertEqual(col_types['ticker'], 'SYMBOL') + self.assertEqual(resp['dataset'][0][0], 'NYSE') + self.assertEqual(resp['dataset'][0][1], 'AAPL') + + def test_qwp_udp_mixed_timestamps(self): + self._require_qwp_udp() + table_name = uuid.uuid4().hex + explicit = qi.TimestampNanos(1_700_000_000_000_000_000) + with self._mk_qwpudp_sender() as sender: + sender.row(table_name, columns={'seq': 1}, + at=qi.ServerTimestamp) + sender.row(table_name, columns={'seq': 2}, at=explicit) + sender.row(table_name, columns={'seq': 3}, + at=qi.ServerTimestamp) + sender.flush() + resp = self.qdb_plain.retry_check_table(table_name, min_rows=3) + self.assertEqual(resp['count'], 3) + rows = sorted(resp['dataset'], key=lambda row: row[0]) + self.assertIn('2023-11-14', rows[1][1]) + + @unittest.skipIf(not pd, 'pandas not installed') + def test_qwp_udp_dataframe_ts_column(self): + self._require_qwp_udp() + table_name = uuid.uuid4().hex + df = pd.DataFrame({ + 'sensor': ['A', 'B'], + 'temp': [22.5, 23.1], + 'ts': pd.to_datetime( + ['2024-01-01 12:00:00', '2024-01-01 12:01:00'], + utc=True), + }) + with self._mk_qwpudp_sender() as sender: + sender.dataframe(df, table_name=table_name, at='ts') + resp = self.qdb_plain.retry_check_table(table_name, min_rows=2) + self.assertEqual(resp['count'], 2) + col_names = [c['name'] for c in resp['columns']] + self.assertIn('timestamp', col_names) + for row in resp['dataset']: + self.assertIn('2024-01-01', row[-1]) + + def test_qwp_udp_new_buffer_inherits_settings(self): + self._require_qwp_udp() + with self._mk_qwpudp_sender( + init_buf_size=2048, max_name_len=32) as sender: + buf = sender.new_buffer() + self.assertEqual(buf.init_buf_size, 2048) + self.assertEqual(buf.max_name_len, 32) + buf.row('t', columns={'a' * 32: 1}, at=qi.ServerTimestamp) + self.assertGreater(len(buf), 0) + with self.assertRaises(qi.IngressError): + buf.row('t', columns={'a' * 33: 1}, at=qi.ServerTimestamp) + + def test_qwp_udp_ilp_buffer_rejected(self): + self._require_qwp_udp() + buf = qi.Buffer.ilp(protocol_version=2) + buf.row('t', columns={'x': 1}, at=qi.ServerTimestamp) + with self._mk_qwpudp_sender() as sender: + with self.assertRaisesRegex( + qi.IngressError, 'QWP/UDP sender requires a QWP buffer'): + sender.flush(buf) + + def test_qwp_udp_buffer_rejected_by_http(self): + self._require_qwp_udp() + buf = qi.Buffer.qwp() + buf.row('t', columns={'x': 1}, at=qi.ServerTimestamp) + with qi.Sender( + qi.Protocol.Http, self.qdb_plain.host, + self.qdb_plain.http_server_port) as sender: + with self.assertRaisesRegex( + qi.IngressError, + 'ILP sender requires an ILP buffer'): + sender.flush(buf) + + def test_qwp_udp_wrong_port_silent(self): + """UDP flush to wrong port succeeds silently (fire-and-forget).""" + self._require_qwp_udp() + with qi.Sender( + qi.Protocol.QwpUdp, + self.qdb_plain.host, 19007) as sender: + sender.row('t', columns={'x': 1}, at=qi.TimestampNanos.now()) + sender.flush() # no error — data goes nowhere + + def test_qwp_udp_unresolvable_host(self): + """Unresolvable host fails at establish().""" + self._require_qwp_udp() + with self.assertRaisesRegex(qi.IngressError, 'Could not resolve'): + with qi.Sender( + qi.Protocol.QwpUdp, + 'this.host.does.not.exist.invalid', 9007) as sender: + pass + + def test_qwp_udp_wide_row(self): + """50 columns + 5 symbols in a single row.""" + self._require_qwp_udp() + table_name = uuid.uuid4().hex + cols = {f'col_{i:02d}': float(i) for i in range(50)} + syms = {f'sym_{i}': f'val_{i}' for i in range(5)} + with self._mk_qwpudp_sender() as sender: + sender.row(table_name, symbols=syms, columns=cols, + at=qi.TimestampNanos.now()) + sender.flush() + resp = self.qdb_plain.retry_check_table(table_name, min_rows=1) + # 50 cols + 5 syms + 1 timestamp = 56 + self.assertEqual(len(resp['columns']), 56) + + def test_qwp_udp_row_ordering(self): + """100 rows with explicit timestamps, split across datagrams.""" + self._require_qwp_udp() + table_name = uuid.uuid4().hex + n = 100 + base_ts = 1_700_000_000_000_000_000 + with self._mk_qwpudp_sender( + max_datagram_size=200, auto_flush=False) as sender: + for i in range(n): + sender.row( + table_name, columns={'seq': i}, + at=qi.TimestampNanos(base_ts + i * 1000)) + sender.flush() + resp = self.qdb_plain.retry_check_table(table_name, min_rows=n) + seqs = sorted(row[0] for row in resp['dataset']) + self.assertEqual(seqs, list(range(n))) + + def test_qwp_udp_tiny_datagram_rejected(self): + """max_datagram_size=1: row exceeds datagram, flush errors.""" + self._require_qwp_udp() + with self._mk_qwpudp_sender( + max_datagram_size=1, auto_flush=False) as sender: + sender.row('t', columns={'x': 1}, at=qi.TimestampNanos.now()) + with self.assertRaisesRegex( + qi.IngressError, 'exceeds maximum datagram size'): + sender.flush() + + def test_qwp_udp_rapid_fire_auto_flush(self): + """2000 rows with pure auto-flush, no explicit flush. + UDP may drop datagrams under load, so we accept >= 90% arrival.""" + self._require_qwp_udp() + table_name = uuid.uuid4().hex + n = 2000 + with self._mk_qwpudp_sender() as sender: + for i in range(n): + sender.row( + table_name, columns={'seq': i}, + at=qi.TimestampNanos.now()) + import time + time.sleep(3) + resp = self.qdb_plain.retry_check_table( + table_name, min_rows=int(n * 0.9)) + self.assertGreaterEqual(resp['count'], int(n * 0.9)) + + def test_qwp_udp_protocol_version_in_conf_rejected(self): + self._require_qwp_udp() + conf = self._mk_qwpudp_conf(protocol_version=2) + with self.assertRaisesRegex( + qi.IngressError, + 'protocol_version.*not supported.*QWP'): + qi.Sender.from_conf(conf) + + def test_qwp_udp_concurrent_senders(self): + """Two senders from different threads to the same port.""" + self._require_qwp_udp() + import threading + t1 = uuid.uuid4().hex + t2 = uuid.uuid4().hex + errors = [] + def writer(table, n=50): + try: + with self._mk_qwpudp_sender() as sender: + for i in range(n): + sender.row( + table, columns={'seq': i}, + at=qi.TimestampNanos.now()) + sender.flush() + except Exception as e: + errors.append(e) + th1 = threading.Thread(target=writer, args=(t1,)) + th2 = threading.Thread(target=writer, args=(t2,)) + th1.start() + th2.start() + th1.join() + th2.join() + self.assertEqual(errors, []) + r1 = self.qdb_plain.retry_check_table(t1, min_rows=50) + r2 = self.qdb_plain.retry_check_table(t2, min_rows=50) + self.assertEqual(r1['count'], 50) + self.assertEqual(r2['count'], 50) + + def test_qwp_udp_double_establish_rejected(self): + self._require_qwp_udp() + sender = self._mk_qwpudp_sender() + sender.establish() + try: + with self.assertRaisesRegex( + qi.IngressError, "establish.*can't be called"): + sender.establish() + finally: + sender.close(flush=False) + + def test_qwp_udp_establish_after_close_rejected(self): + self._require_qwp_udp() + sender = self._mk_qwpudp_sender() + sender.establish() + sender.close(flush=False) + with self.assertRaisesRegex( + qi.IngressError, "establish.*can't be called"): + sender.establish() + + def test_qwp_udp_decimal_zero_and_negative(self): + self._require_qwp_udp() + if self.qdb_plain.version < FIRST_DECIMAL_RELEASE: + self.skipTest('old server does not support decimal') + table_name = uuid.uuid4().hex + self.qdb_plain.http_sql_query( + f'CREATE TABLE {table_name} ' + f'(val DECIMAL(18,3), timestamp TIMESTAMP) ' + f'TIMESTAMP(timestamp) PARTITION BY DAY;') + with self._mk_qwpudp_sender() as sender: + sender.row(table_name, + columns={'val': decimal.Decimal('0.000')}, + at=qi.TimestampNanos.now()) + sender.row(table_name, + columns={'val': decimal.Decimal('-0.000')}, + at=qi.TimestampNanos.now()) + sender.row(table_name, + columns={'val': decimal.Decimal('-123456789.012')}, + at=qi.TimestampNanos.now()) + sender.flush() + resp = self.qdb_plain.retry_check_table(table_name, min_rows=3) + vals = [row[0] for row in resp['dataset']] + self.assertEqual(vals[0], '0.000') + self.assertIn(vals[1], ('0.000', '-0.000')) + self.assertEqual(vals[2], '-123456789.012') + + def test_qwp_udp_decimal_max_precision(self): + self._require_qwp_udp() + if self.qdb_plain.version < FIRST_DECIMAL_RELEASE: + self.skipTest('old server does not support decimal') + table_name = uuid.uuid4().hex + self.qdb_plain.http_sql_query( + f'CREATE TABLE {table_name} ' + f'(val DECIMAL(18,3), timestamp TIMESTAMP) ' + f'TIMESTAMP(timestamp) PARTITION BY DAY;') + with self._mk_qwpudp_sender() as sender: + sender.row(table_name, + columns={'val': decimal.Decimal('999999999999999.999')}, + at=qi.TimestampNanos.now()) + sender.row(table_name, + columns={'val': decimal.Decimal('0.001')}, + at=qi.TimestampNanos.now()) + sender.flush() + resp = self.qdb_plain.retry_check_table(table_name, min_rows=2) + vals = [row[0] for row in resp['dataset']] + self.assertEqual(vals[0], '999999999999999.999') + self.assertEqual(vals[1], '0.001') + + def test_qwp_udp_decimal_nan_inf_rejected(self): + self._require_qwp_udp() + if self.qdb_plain.version < FIRST_DECIMAL_RELEASE: + self.skipTest('old server does not support decimal') + table_name = uuid.uuid4().hex + self.qdb_plain.http_sql_query( + f'CREATE TABLE {table_name} ' + f'(val DECIMAL(18,3), timestamp TIMESTAMP) ' + f'TIMESTAMP(timestamp) PARTITION BY DAY;') + with self._mk_qwpudp_sender() as sender: + with self.assertRaises(qi.IngressError): + sender.row(table_name, + columns={'val': decimal.Decimal('NaN')}, + at=qi.TimestampNanos.now()) + with self.assertRaises(qi.IngressError): + sender.row(table_name, + columns={'val': decimal.Decimal('Inf')}, + at=qi.TimestampNanos.now()) + + def test_qwp_udp_decimal_multiple_columns(self): + self._require_qwp_udp() + if self.qdb_plain.version < FIRST_DECIMAL_RELEASE: + self.skipTest('old server does not support decimal') + table_name = uuid.uuid4().hex + self.qdb_plain.http_sql_query( + f'CREATE TABLE {table_name} ' + f'(price DECIMAL(18,2), fee DECIMAL(18,6), timestamp TIMESTAMP) ' + f'TIMESTAMP(timestamp) PARTITION BY DAY;') + with self._mk_qwpudp_sender() as sender: + sender.row(table_name, + columns={ + 'price': decimal.Decimal('199.99'), + 'fee': decimal.Decimal('0.000123')}, + at=qi.TimestampNanos.now()) + sender.flush() + resp = self.qdb_plain.retry_check_table(table_name, min_rows=1) + row = resp['dataset'][0] + self.assertEqual(row[0], '199.99') + self.assertEqual(row[1], '0.000123') + + @unittest.skipIf(not pyarrow, 'pyarrow not installed') + @unittest.skipIf(not pd, 'pandas not installed') + def test_qwp_udp_decimal_pyarrow_nulls(self): + self._require_qwp_udp() + if self.qdb_plain.version < FIRST_DECIMAL_RELEASE: + self.skipTest('old server does not support decimal') + table_name = uuid.uuid4().hex + self.qdb_plain.http_sql_query( + f'CREATE TABLE {table_name} ' + f'(val DECIMAL(18,3), seq LONG, timestamp TIMESTAMP) ' + f'TIMESTAMP(timestamp) PARTITION BY DAY;') + df = pd.DataFrame({ + 'val': pd.array( + [decimal.Decimal('1.5'), None, decimal.Decimal('3.25')], + dtype=pd.ArrowDtype(pyarrow.decimal128(18, 3))), + 'seq': [1, 2, 3], + }) + with self._mk_qwpudp_sender() as sender: + sender.dataframe(df, table_name=table_name, at=qi.ServerTimestamp) + resp = self.qdb_plain.retry_check_table(table_name, min_rows=3) + vals = [row[0] for row in resp['dataset']] + self.assertIn('1.500', vals) + self.assertIn(None, vals) + self.assertIn('3.250', vals) + + @unittest.skipIf(not pyarrow, 'pyarrow not installed') + @unittest.skipIf(not pd, 'pandas not installed') + def test_qwp_udp_decimal_pyarrow(self): + self._require_qwp_udp() + if self.qdb_plain.version < FIRST_DECIMAL_RELEASE: + self.skipTest('old server does not support decimal') + table_name = uuid.uuid4().hex + self.qdb_plain.http_sql_query( + f'CREATE TABLE {table_name} ' + f'(prices DECIMAL(18,3), timestamp TIMESTAMP) ' + f'TIMESTAMP(timestamp) PARTITION BY DAY;') + df = pd.DataFrame({ + 'prices': pd.array( + [ + decimal.Decimal('-99999.99'), + decimal.Decimal('-678'), + ], + dtype=pd.ArrowDtype(pyarrow.decimal128(18, 2)) + ) + }) + with self._mk_qwpudp_sender() as sender: + sender.dataframe(df, table_name=table_name, at=qi.ServerTimestamp) + resp = self.qdb_plain.retry_check_table(table_name, min_rows=2) + exp_columns = [ + {'name': 'prices', 'type': 'DECIMAL(18,3)'}, + {'name': 'timestamp', 'type': 'TIMESTAMP'}] + self.assertEqual(resp['columns'], exp_columns) + scrubbed = [row[:-1] for row in resp['dataset']] + self.assertEqual(scrubbed, [['-99999.990'], ['-678.000']]) + + def test_f64_arr(self): + if self.qdb_plain.version < FIRST_ARRAY_RELEASE: + self.skipTest('old server does not support array') + table_name = uuid.uuid4().hex + array1 = np.array( + [ + [[1.1, 2.2], [3.3, 4.4]], + [[5.5, 6.6], [7.7, 8.8]] + ], + dtype=np.float64 + ) + array2 = array1.T + array3 = array1[::-1, ::-1] + with qi.Sender('http', 'localhost', self.qdb_plain.http_server_port) as sender: + sender.row( + table_name, + columns={ + 'f64_arr1': array1, + 'f64_arr2': array2, + 'f64_arr3': array3}, + at=qi.ServerTimestamp) + resp = self.qdb_plain.retry_check_table(table_name) + exp_columns = [{'dim': 3, 'elemType': 'DOUBLE', 'name': 'f64_arr1', 'type': 'ARRAY'}, + {'dim': 3, 'elemType': 'DOUBLE', 'name': 'f64_arr2', 'type': 'ARRAY'}, + {'dim': 3, 'elemType': 'DOUBLE', 'name': 'f64_arr3', 'type': 'ARRAY'}, + {'name': 'timestamp', 'type': 'TIMESTAMP'}] + self.assertEqual(resp['columns'], exp_columns) + expected_data = [[[[[1.1, 2.2], [3.3, 4.4]], [[5.5, 6.6], [7.7, 8.8]]], + [[[1.1, 5.5], [3.3, 7.7]], [[2.2, 6.6], [4.4, 8.8]]], + [[[7.7, 8.8], [5.5, 6.6]], [[3.3, 4.4], [1.1, 2.2]]]]] + scrubbed_data = [row[:-1] for row in resp['dataset']] + self.assertEqual(scrubbed_data, expected_data) + + def test_decimal_py_obj(self): + if self.qdb_plain.version < FIRST_DECIMAL_RELEASE: + self.skipTest('old server does not support decimal') + + table_name = uuid.uuid4().hex + self.qdb_plain.http_sql_query(f'CREATE TABLE {table_name} (dec_col DECIMAL(18,3), timestamp TIMESTAMP) TIMESTAMP(timestamp) PARTITION BY DAY;') + + pending = None + with qi.Sender('http', 'localhost', self.qdb_plain.http_server_port) as sender: + sender.row( + table_name, + columns={ + 'dec_col': decimal.Decimal('12345.678')}, + at=qi.ServerTimestamp) + pending = bytes(sender) + + resp = self.qdb_plain.retry_check_table(table_name, min_rows=1, log_ctx=pending) + exp_columns = [{'name': 'dec_col', 'type': 'DECIMAL(18,3)'}, + {'name': 'timestamp', 'type': 'TIMESTAMP'}] + self.assertEqual(resp['columns'], exp_columns) + expected_data = [['12345.678']] + scrubbed_data = [row[:-1] for row in resp['dataset']] + self.assertEqual(scrubbed_data, expected_data) + + @unittest.skipIf(not pyarrow, 'pyarrow not installed') + @unittest.skipIf(not pd, 'pandas not installed') + def test_decimal_pyarrow(self): + if self.qdb_plain.version < FIRST_DECIMAL_RELEASE: + self.skipTest('old server does not support decimal') + + table_name = uuid.uuid4().hex + self.qdb_plain.http_sql_query(f'CREATE TABLE {table_name} (prices DECIMAL(18,3), timestamp TIMESTAMP) TIMESTAMP(timestamp) PARTITION BY DAY;') + + df = pd.DataFrame({ + 'prices': pd.array( + [ + decimal.Decimal('-99999.99'), + decimal.Decimal('-678'), + ], + dtype=pd.ArrowDtype(pyarrow.decimal128(18, 2)) + ) + }) + + pending = None + with qi.Sender('http', 'localhost', self.qdb_plain.http_server_port) as sender: + sender.dataframe(df, table_name=table_name, at=qi.ServerTimestamp) + pending = bytes(sender) + + resp = self.qdb_plain.retry_check_table(table_name, min_rows=2, log_ctx=pending) + exp_columns = [{'name': 'prices', 'type': 'DECIMAL(18,3)'}, + {'name': 'timestamp', 'type': 'TIMESTAMP'}] + self.assertEqual(resp['columns'], exp_columns) + expected_data = [ + ['-99999.990'], + ['-678.000'], + ] + scrubbed_data = [row[:-1] for row in resp['dataset']] + self.assertEqual(scrubbed_data, expected_data) + + +class TestEgressWithDatabase(unittest.TestCase): + """Live-server coverage for ``Client.query(...)``. + + Reuses ``TestWithDatabase`` fixture setup. The egress reader path + is HTTP/QWP-only; we don't replicate the TLS+auth ingress matrix + since the auth fixture's QWP/HTTP endpoint is unauthenticated + (``http_auth=False``). Conf-string + TLS plumbing for egress is + derived from the ingress side; if it breaks there the existing + ingress matrix catches it. + """ + + @classmethod + def setUpClass(cls): + # Reuse the fixture lifecycle from TestWithDatabase. + TestWithDatabase.setUpClass.__func__(cls) + + @classmethod + def tearDownClass(cls): + TestWithDatabase.tearDownClass.__func__(cls) + + def _require_qwp_ws(self): + if self.qdb_plain.version < FIRST_QWP_WS_RELEASE: + self.skipTest( + 'QWP/WebSocket integration tests require QuestDB 9.4.3+') + + def setUp(self): + self._require_qwp_ws() + + def _conf(self): + return (f'qwpws::addr={self.qdb_plain.host}:' + f'{self.qdb_plain.http_server_port};') + + def _exec(self, sql): + return self.qdb_plain.http_sql_query(sql) + + def test_type_coverage_round_trip(self): + """One row, every QuestDB type we can express in SQL, read back + via ``Client.query``. Single WAL apply, one query, per-column + assertions on Arrow dtype and value. + + Decimal / Array are deferred: their SQL literal syntax varies + across QuestDB versions and they're better verified once + ingress writes them too. + """ + import pyarrow as pa + table_name = 't_egress_types_' + uuid.uuid4().hex[:8] + try: + self._exec( + f'CREATE TABLE {table_name} (' + 'ts TIMESTAMP, ' + 'b BOOLEAN, by BYTE, sh SHORT, i INT, lg LONG, ' + 'fl FLOAT, db DOUBLE, ' + 'ts_ns TIMESTAMP_NS, dt DATE, ' + 'sym SYMBOL, vc VARCHAR, st STRING, ch CHAR, ' + 'uu UUID, l256 LONG256, ip IPV4, gh GEOHASH(8c)' + ') TIMESTAMP(ts) PARTITION BY DAY WAL') + self._exec( + f"INSERT INTO {table_name} VALUES (" + "'2024-01-01T00:00:00.000000Z', " + "true, 7, 700, 70000, 7000000000, " + "3.5, 6.5, " + "'2024-01-01T00:00:00.123456789Z', " + "'2024-01-02', " + "'AAA', 'varchar-value', 'string-value', 'C', " + "'11111111-2222-3333-4444-555555555555', " + "'0x0001020304050607080910111213141516171819202122232425262728293031', " + "'192.168.1.10', " + "'s00twy01'" + ")") + self.qdb_plain.retry_check_table(table_name, min_rows=1) + + with qi.Client.from_conf(self._conf()) as client: + table = client.query( + f'SELECT * FROM {table_name}').to_arrow() + + self.assertEqual(table.num_rows, 1) + sch = table.schema + # Numeric / boolean primitives. + self.assertEqual(sch.field('b').type, pa.bool_()) + self.assertEqual(sch.field('by').type, pa.int8()) + self.assertEqual(sch.field('sh').type, pa.int16()) + self.assertEqual(sch.field('i').type, pa.int32()) + self.assertEqual(sch.field('lg').type, pa.int64()) + self.assertEqual(sch.field('fl').type, pa.float32()) + self.assertEqual(sch.field('db').type, pa.float64()) + # Temporal. + self.assertEqual( + sch.field('ts').type, + pa.timestamp('us', tz='UTC')) + self.assertEqual( + sch.field('ts_ns').type, + pa.timestamp('ns', tz='UTC')) + self.assertEqual( + sch.field('dt').type, + pa.timestamp('ms', tz='UTC')) + # Strings. + self.assertEqual( + sch.field('sym').type, + pa.dictionary(pa.uint32(), pa.utf8())) + self.assertEqual(sch.field('vc').type, pa.utf8()) + self.assertEqual(sch.field('st').type, pa.utf8()) + self.assertEqual(sch.field('ch').type, pa.uint16()) + # Fixed-size / extension. UUID is surfaced as pyarrow's + # registered Arrow `arrow.uuid` extension type (storage = + # FixedSizeBinary(16)). + uu_type = sch.field('uu').type + if isinstance(uu_type, pa.BaseExtensionType): + self.assertEqual(uu_type.extension_name, 'arrow.uuid') + self.assertEqual(uu_type.storage_type, pa.binary(16)) + else: + self.assertEqual(uu_type, pa.binary(16)) + self.assertEqual( + sch.field('l256').type, pa.binary(32)) + self.assertEqual(sch.field('ip').type, pa.uint32()) + # Geohash precision_bits=40 (8 chars × 5 bits) → int64. + self.assertEqual(sch.field('gh').type, pa.int64()) + + # Spot-check a few values. + row = table.to_pylist()[0] + self.assertIs(row['b'], True) + self.assertEqual(row['by'], 7) + self.assertEqual(row['sh'], 700) + self.assertEqual(row['i'], 70000) + self.assertEqual(row['lg'], 7000000000) + self.assertAlmostEqual(row['fl'], 3.5) + self.assertAlmostEqual(row['db'], 6.5) + self.assertEqual(row['sym'], 'AAA') + self.assertEqual(row['vc'], 'varchar-value') + self.assertEqual(row['st'], 'string-value') + self.assertEqual(row['ch'], ord('C')) + finally: + try: + self._exec(f'DROP TABLE IF EXISTS {table_name}') + except Exception: + pass + + def test_empty_result(self): + """A query that returns zero rows. The server still sends a + terminal frame; ``Client.query`` must not hang or crash. + Current behaviour: returns a DataFrame with zero columns (the + ``pa.table({})`` fallback). Pins that contract.""" + import pandas as pd + table_name = 't_egress_empty_' + uuid.uuid4().hex[:8] + try: + self._exec( + f'CREATE TABLE {table_name} ' + '(ts TIMESTAMP, x LONG) ' + 'TIMESTAMP(ts) PARTITION BY DAY WAL') + with qi.Client.from_conf(self._conf()) as client: + pdf = client.query( + f'SELECT * FROM {table_name} WHERE x = 1' + ).to_pandas() + self.assertIsInstance(pdf, pd.DataFrame) + self.assertEqual(len(pdf), 0) + finally: + try: + self._exec(f'DROP TABLE IF EXISTS {table_name}') + except Exception: + pass + + def test_polars_from_arrow_consumes_capsule(self): + """``Client.query`` exposes ``__arrow_c_stream__`` directly off + the Rust cursor, so polars can consume it without pyarrow being + the import-time mediator. Pins that contract: the polars frame + round-trips the rows and our lazy ``_PYARROW`` global stays + unset by the call.""" + try: + import polars as pl + except ImportError: + self.skipTest('polars not installed') + table_name = 't_egress_polars_' + uuid.uuid4().hex[:8] + try: + self._exec( + f'CREATE TABLE {table_name} ' + '(ts TIMESTAMP, lg LONG, vc VARCHAR) ' + 'TIMESTAMP(ts) PARTITION BY DAY WAL') + self._exec( + f"INSERT INTO {table_name} VALUES " + f"('2024-01-01T00:00:00Z', 42, 'hello'), " + f"('2024-01-02T00:00:00Z', 7, 'world')") + self.qdb_plain.retry_check_table(table_name, min_rows=2) + with qi.Client.from_conf(self._conf()) as client: + with client.query( + f'SELECT lg, vc FROM {table_name} ORDER BY lg DESC' + ) as result: + df = pl.from_arrow(result) + self.assertEqual(df.shape, (2, 2)) + self.assertEqual(df['lg'].to_list(), [42, 7]) + self.assertEqual(df['vc'].to_list(), ['hello', 'world']) + finally: + try: + self._exec(f'DROP TABLE IF EXISTS {table_name}') + except Exception: + pass + + def test_to_polars_and_iter_polars_symbol_categorical(self): + """SYMBOL egresses as a polars ``Categorical`` (codes + dict via the + registry, no per-row remap), nulls preserved. ``iter_polars`` over a + multi-batch result stitches via ``pl.concat`` to the same frame — + every batch shares one ``Categories`` identity.""" + try: + import polars as pl + except ImportError: + self.skipTest('polars not installed') + import numpy as np + n = 100000 + table_name = 't_egress_iterpolars_' + uuid.uuid4().hex[:8] + exp = [None if i % 11 == 0 else f'sym_{i % 100}' for i in range(n)] + try: + self._exec( + f'CREATE TABLE {table_name} ' + '(ts TIMESTAMP, sym SYMBOL, v LONG) ' + 'TIMESTAMP(ts) PARTITION BY DAY WAL') + df = pd.DataFrame({ + 'ts': pd.to_datetime(np.arange(n), unit='s', utc=True), + 'sym': pd.Series(exp, dtype='string[pyarrow]'), + 'v': np.arange(n, dtype=np.int64), + }) + with qi.Client.from_conf(self._conf()) as client: + client.dataframe( + df, table_name=table_name, at='ts', symbols=['sym']) + self.qdb_plain.retry_check_table(table_name, min_rows=n) + sql = f'SELECT sym, v FROM {table_name} ORDER BY v' + with qi.Client.from_conf(self._conf()) as client: + full = client.query(sql).to_polars() + self.assertEqual(full.shape, (n, 2)) + self.assertIsInstance(full.schema['sym'], pl.Categorical) + self.assertGreater(full['sym'].null_count(), 0) + self.assertEqual(full['v'].to_list(), list(range(n))) + self.assertEqual(full['sym'].cast(pl.Utf8).to_list(), exp) + with qi.Client.from_conf(self._conf()) as client: + frames = list(client.query(sql).iter_polars()) + self.assertGreater(len(frames), 1) + stitched = pl.concat(frames, how='vertical') + self.assertIsInstance(stitched.schema['sym'], pl.Categorical) + self.assertEqual(stitched['v'].to_list(), list(range(n))) + self.assertEqual(stitched['sym'].cast(pl.Utf8).to_list(), exp) + finally: + try: + self._exec(f'DROP TABLE IF EXISTS {table_name}') + except Exception: + pass + + def _make_table(self, table_name, rows): + self._exec( + f'CREATE TABLE {table_name} ' + '(ts TIMESTAMP, lg LONG) ' + 'TIMESTAMP(ts) PARTITION BY DAY WAL') + values = ', '.join( + f"('2024-01-01T00:00:0{i % 10}.000000Z', {i})" + for i in range(rows)) + self._exec(f'INSERT INTO {table_name} VALUES {values}') + self.qdb_plain.retry_check_table(table_name, min_rows=rows) + + def test_query_result_single_use(self): + """A ``QueryResult`` is single-use: a second materialisation, or + any materialisation after ``close()``, raises ``InvalidApiCall``. + Also pins the ``__arrow_c_stream__`` ``requested_schema`` + rejection.""" + table_name = 't_egress_single_' + uuid.uuid4().hex[:8] + try: + self._make_table(table_name, 1) + sql = f'SELECT lg FROM {table_name}' + with qi.Client.from_conf(self._conf()) as client: + result = client.query(sql) + result.to_arrow() + with self.assertRaises(qi.IngressError) as cm: + result.to_arrow() + self.assertEqual( + cm.exception.code, qi.IngressErrorCode.InvalidApiCall) + + closed = client.query(sql) + closed.close() + with self.assertRaises(qi.IngressError): + closed.to_pandas() + + stream = client.query(sql) + with self.assertRaises(NotImplementedError): + stream.__arrow_c_stream__(requested_schema=object()) + stream.close() + finally: + try: + self._exec(f'DROP TABLE IF EXISTS {table_name}') + except Exception: + pass + + def test_cancel_is_safe_and_idempotent(self): + """``cancel()`` drives the FFI cancel on a live cursor without + raising, is idempotent, and is a no-op after ``close()``.""" + table_name = 't_egress_cancel_' + uuid.uuid4().hex[:8] + try: + self._make_table(table_name, 8) + sql = f'SELECT lg FROM {table_name}' + with qi.Client.from_conf(self._conf()) as client: + with client.query(sql) as result: + result.cancel() + result.cancel() + + closed = client.query(sql) + closed.close() + closed.cancel() + finally: + try: + self._exec(f'DROP TABLE IF EXISTS {table_name}') + except Exception: + pass + + def test_capsule_path_no_leak(self): + """Loop the native ``__arrow_c_stream__`` paths — full consume, + abandoned (un-consumed) capsule, and empty result — and assert no + ``QueryResult`` is leaked. Exercises the producer refcount dance + and the capsule destructor under repetition for leak detectors.""" + import gc + table_name = 't_egress_leak_' + uuid.uuid4().hex[:8] + empty_name = 't_egress_leak_empty_' + uuid.uuid4().hex[:8] + try: + self._make_table(table_name, 4) + self._exec( + f'CREATE TABLE {empty_name} ' + '(ts TIMESTAMP, lg LONG) ' + 'TIMESTAMP(ts) PARTITION BY DAY WAL') + sql = f'SELECT lg FROM {table_name}' + empty_sql = f'SELECT lg FROM {empty_name} WHERE lg = -1' + with qi.Client.from_conf(self._conf()) as client: + gc.collect() + before = sum( + 1 for o in gc.get_objects() + if type(o) is qi.QueryResult) + for _ in range(64): + client.query(sql).to_arrow() + abandoned = client.query(sql) + capsule = abandoned.__arrow_c_stream__() + del capsule + del abandoned + client.query(empty_sql).to_arrow() + gc.collect() + after = sum( + 1 for o in gc.get_objects() + if type(o) is qi.QueryResult) + self.assertEqual(after, before) + finally: + for name in (table_name, empty_name): + try: + self._exec(f'DROP TABLE IF EXISTS {name}') + except Exception: + pass + + def test_bad_sql_raises_ingress_error(self): + """Server-side parse error surfaces as an ``IngressError`` from + ``client.query`` with a usable message.""" + with qi.Client.from_conf(self._conf()) as client: + with self.assertRaises(qi.IngressError) as cm: + client.query( + 'SELECT * FROM nonexistent_table_xyz_abc_123' + ).to_arrow() + msg = str(cm.exception) + # Don't pin the exact message — just check the user gets + # something informative about the missing table. + self.assertTrue( + 'nonexistent_table_xyz' in msg.lower() + or 'does not exist' in msg.lower() + or 'not found' in msg.lower() + or 'invalid' in msg.lower(), + f'expected error message to mention the missing table; ' + f'got {msg!r}') + + def test_dtype_backend_variants(self): + """Validate the three `to_pandas` mappings: default (numpy + primitives + new ``str`` dtype), ``pyarrow`` (ArrowDtype-backed), + and ``numpy_nullable`` (pandas extension types). + + QuestDB BYTE column → int8/Int8Dtype/ArrowDtype(int8); LONG → + int64/Int64Dtype/ArrowDtype(int64); VARCHAR → str/StringDtype/ + ArrowDtype(string). One iteration, three reads against the same + table. + """ + import pandas as pd + import pyarrow as pa + table_name = 't_egress_dtype_' + uuid.uuid4().hex[:8] + try: + self._exec( + f'CREATE TABLE {table_name} ' + '(ts TIMESTAMP, lg LONG, vc VARCHAR) ' + 'TIMESTAMP(ts) PARTITION BY DAY WAL') + self._exec( + f"INSERT INTO {table_name} VALUES " + f"('2024-01-01T00:00:00Z', 42, 'hello')") + self.qdb_plain.retry_check_table(table_name, min_rows=1) + + sql = f'SELECT lg, vc FROM {table_name}' + with qi.Client.from_conf(self._conf()) as client: + default = client.query(sql).to_pandas() + arrow_backed = client.query(sql).to_pandas( + dtype_backend='pyarrow') + nullable = client.query(sql).to_pandas( + dtype_backend='numpy_nullable') + + # Default: numpy int64, new pandas 3.0 str dtype. + self.assertEqual(default['lg'].dtype, np.int64) + self.assertTrue( + pd.api.types.is_string_dtype(default['vc'].dtype), + f'expected str dtype, got {default["vc"].dtype!r}') + + # pyarrow: ArrowDtype-wrapped. + self.assertIsInstance( + arrow_backed['lg'].dtype, pd.ArrowDtype) + self.assertEqual( + arrow_backed['lg'].dtype.pyarrow_dtype, pa.int64()) + self.assertIsInstance( + arrow_backed['vc'].dtype, pd.ArrowDtype) + self.assertEqual( + arrow_backed['vc'].dtype.pyarrow_dtype, pa.string()) + + # numpy_nullable: pandas extension dtypes for primitives. + self.assertIsInstance(nullable['lg'].dtype, pd.Int64Dtype) + self.assertIsInstance(nullable['vc'].dtype, pd.StringDtype) + + # Mutual-exclusion + invalid-value rejection. + with qi.Client.from_conf(self._conf()) as client: + with self.assertRaises(ValueError): + client.query(sql).to_pandas( + dtype_backend='pyarrow', types_mapper=lambda t: None) + with self.assertRaises(ValueError): + client.query(sql).to_pandas( + dtype_backend='not_a_thing') + finally: + try: + self._exec(f'DROP TABLE IF EXISTS {table_name}') + except Exception: + pass + + def test_symbol_column_to_pandas(self): + """SYMBOL egresses as dictionary(uint32, utf8); pandas rejects + unsigned dictionary indices, so to_pandas / iter_pandas must + recast the index to int32. Covers the three dtype_backend + variants plus the streaming iter_pandas path. + """ + import pandas as pd + import pyarrow as pa + table_name = 't_egress_symbol_' + uuid.uuid4().hex[:8] + try: + self._exec( + f'CREATE TABLE {table_name} ' + '(ts TIMESTAMP, sym SYMBOL, lg LONG) ' + 'TIMESTAMP(ts) PARTITION BY DAY WAL') + self._exec( + f"INSERT INTO {table_name} VALUES " + f"('2024-01-01T00:00:00Z', 'aa', 1), " + f"('2024-01-01T00:00:01Z', 'bb', 2), " + f"('2024-01-01T00:00:02Z', 'aa', 3)") + self.qdb_plain.retry_check_table(table_name, min_rows=3) + + sql = f'SELECT sym, lg FROM {table_name} ORDER BY ts' + + # Wire format: SYMBOL arrives as a dictionary with an + # unsigned index — the input that breaks pandas conversion. + with qi.Client.from_conf(self._conf()) as client: + table = client.query(sql).to_arrow() + sym_type = table.schema.field('sym').type + self.assertTrue( + pa.types.is_dictionary(sym_type), + f'expected dictionary type for SYMBOL; got {sym_type}') + self.assertTrue( + pa.types.is_unsigned_integer(sym_type.index_type), + f'expected unsigned dict index; got {sym_type.index_type}') + + # default to_pandas: must not raise; SYMBOL -> Categorical. + with qi.Client.from_conf(self._conf()) as client: + default = client.query(sql).to_pandas() + self.assertEqual(str(default['sym'].dtype), 'category') + self.assertEqual(list(default['sym']), ['aa', 'bb', 'aa']) + self.assertEqual(list(default['lg']), [1, 2, 3]) + + # pyarrow + numpy_nullable backends: also must not raise. + with qi.Client.from_conf(self._conf()) as client: + arrow_backed = client.query(sql).to_pandas( + dtype_backend='pyarrow') + self.assertEqual(list(arrow_backed['sym']), ['aa', 'bb', 'aa']) + with qi.Client.from_conf(self._conf()) as client: + nullable = client.query(sql).to_pandas( + dtype_backend='numpy_nullable') + self.assertEqual(list(nullable['sym']), ['aa', 'bb', 'aa']) + + # streaming iter_pandas exercises the same per-batch recast. + with qi.Client.from_conf(self._conf()) as client: + syms = [] + for chunk in client.query(sql).iter_pandas(): + syms.extend(chunk['sym'].tolist()) + self.assertEqual(syms, ['aa', 'bb', 'aa']) + finally: + try: + self._exec(f'DROP TABLE IF EXISTS {table_name}') + except Exception: + pass + + def test_numpy_egress_round_trip(self): + """The native (default) ``to_pandas()`` output feeds straight back + into ``Client.dataframe`` and reproduces the same values for the + types that round-trip through the numpy path + (long/double/bool/varchar/symbol/timestamp). Also checks the + ``df.attrs['questdb']`` round-trip metadata is attached. + """ + import numpy as np + src = 't_rt_src_' + uuid.uuid4().hex[:8] + dst = 't_rt_dst_' + uuid.uuid4().hex[:8] + cols = 'ts, lg, db, bl, vc, sym' + try: + self._exec( + f'CREATE TABLE {src} ' + '(ts TIMESTAMP, lg LONG, db DOUBLE, bl BOOLEAN, ' + 'vc VARCHAR, sym SYMBOL) TIMESTAMP(ts) PARTITION BY DAY WAL') + self._exec( + f"INSERT INTO {src} VALUES " + f"('2024-01-01T00:00:00Z', 1, 1.5, true, 'aa', 's1'), " + f"('2024-01-01T00:00:01Z', 2, 2.5, false, 'bb', 's2'), " + f"('2024-01-01T00:00:02Z', 3, 3.5, true, 'cc', 's1')") + self.qdb_plain.retry_check_table(src, min_rows=3) + + with qi.Client.from_conf(self._conf()) as client: + df = client.query( + f'SELECT {cols} FROM {src} ORDER BY ts').to_pandas() + + meta = df.attrs['questdb']['columns'] + self.assertEqual(meta['lg']['kind'], 'long') + self.assertEqual(meta['db']['kind'], 'double') + self.assertEqual(meta['sym']['kind'], 'symbol') + self.assertEqual(meta['vc']['kind'], 'varchar') + self.assertEqual(meta['ts']['kind'], 'timestamp') + self.assertEqual(df['lg'].dtype, np.int64) + self.assertEqual(str(df['sym'].dtype), 'category') + + self._exec( + f'CREATE TABLE {dst} ' + '(ts TIMESTAMP, lg LONG, db DOUBLE, bl BOOLEAN, ' + 'vc VARCHAR, sym SYMBOL) TIMESTAMP(ts) PARTITION BY DAY WAL') + with qi.Client.from_conf(self._conf()) as client: + client.dataframe(df, table_name=dst, at='ts') + self.qdb_plain.retry_check_table(dst, min_rows=3) + + with qi.Client.from_conf(self._conf()) as client: + back = client.query( + f'SELECT {cols} FROM {dst} ORDER BY ts').to_pandas() + self.assertEqual(list(back['lg']), [1, 2, 3]) + self.assertEqual(list(back['db']), [1.5, 2.5, 3.5]) + self.assertEqual([bool(x) for x in back['bl']], [True, False, True]) + self.assertEqual(list(back['vc']), ['aa', 'bb', 'cc']) + self.assertEqual(list(back['sym']), ['s1', 's2', 's1']) + finally: + for t in (src, dst): + try: + self._exec(f'DROP TABLE IF EXISTS {t}') + except Exception: + pass + + def test_numpy_egress_hybrid_nulls(self): + """Default (hybrid) null handling: a nullable LONG with nulls + becomes pandas ``Int64`` (``pd.NA``, analysis-safe); a LONG without + nulls stays plain ``int64``; DOUBLE null -> ``float64`` NaN; VARCHAR + null -> ``object`` None. + """ + import pandas as pd + import numpy as np + table_name = 't_egress_hybrid_' + uuid.uuid4().hex[:8] + try: + self._exec( + f'CREATE TABLE {table_name} ' + '(ts TIMESTAMP, lg LONG, lg2 LONG, db DOUBLE, vc VARCHAR) ' + 'TIMESTAMP(ts) PARTITION BY DAY WAL') + self._exec( + f"INSERT INTO {table_name} VALUES " + f"('2024-01-01T00:00:00Z', 7, 10, 1.5, 'x'), " + f"('2024-01-01T00:00:01Z', NULL, 20, NULL, NULL)") + self.qdb_plain.retry_check_table(table_name, min_rows=2) + with qi.Client.from_conf(self._conf()) as client: + df = client.query( + f'SELECT lg, lg2, db, vc FROM {table_name} ORDER BY ts' + ).to_pandas() + # nullable LONG with a null -> Int64 (pd.NA) + self.assertEqual(str(df['lg'].dtype), 'Int64') + self.assertEqual(df['lg'].iloc[0], 7) + self.assertTrue(df['lg'].iloc[1] is pd.NA) + # LONG with no nulls -> plain int64 + self.assertEqual(df['lg2'].dtype, np.int64) + self.assertEqual(list(df['lg2']), [10, 20]) + # DOUBLE null -> float64 NaN; VARCHAR null -> object None + self.assertTrue(pd.api.types.is_float_dtype(df['db'].dtype)) + self.assertTrue(pd.isna(df['db'].iloc[1])) + self.assertEqual(df['vc'].iloc[0], 'x') + self.assertTrue(pd.isna(df['vc'].iloc[1])) + finally: + try: + self._exec(f'DROP TABLE IF EXISTS {table_name}') + except Exception: + pass + + def test_numpy_egress_nullable_round_trip(self): + """A nullable LONG round-trips through the default hybrid output: + query -> to_pandas (Int64 with pd.NA) -> Client.dataframe (normalised + to object + validity) -> query reproduces the value and the null. + """ + import pandas as pd + src = 't_rtn_src_' + uuid.uuid4().hex[:8] + dst = 't_rtn_dst_' + uuid.uuid4().hex[:8] + try: + self._exec( + f'CREATE TABLE {src} (ts TIMESTAMP, lg LONG) ' + 'TIMESTAMP(ts) PARTITION BY DAY WAL') + self._exec( + f"INSERT INTO {src} VALUES " + f"('2024-01-01T00:00:00Z', 7), " + f"('2024-01-01T00:00:01Z', NULL), " + f"('2024-01-01T00:00:02Z', 9)") + self.qdb_plain.retry_check_table(src, min_rows=3) + with qi.Client.from_conf(self._conf()) as client: + df = client.query( + f'SELECT ts, lg FROM {src} ORDER BY ts').to_pandas() + self.assertEqual(str(df['lg'].dtype), 'Int64') + self._exec( + f'CREATE TABLE {dst} (ts TIMESTAMP, lg LONG) ' + 'TIMESTAMP(ts) PARTITION BY DAY WAL') + with qi.Client.from_conf(self._conf()) as client: + client.dataframe(df, table_name=dst, at='ts') + self.qdb_plain.retry_check_table(dst, min_rows=3) + with qi.Client.from_conf(self._conf()) as client: + back = client.query( + f'SELECT lg FROM {dst} ORDER BY ts').to_pandas() + self.assertEqual(back['lg'].iloc[0], 7) + self.assertTrue(back['lg'].iloc[1] is pd.NA) + self.assertEqual(back['lg'].iloc[2], 9) + finally: + for t in (src, dst): + try: + self._exec(f'DROP TABLE IF EXISTS {t}') + except Exception: + pass + + def test_numpy_egress_round_trip_overrides(self): + """ipv4 / char / geohash round-trip through the native numpy path + driven by df.attrs metadata (no pyarrow). The destination column + types are verified by re-querying and checking the egress metadata + reports the same kinds. + """ + src = 't_rto_src_' + uuid.uuid4().hex[:8] + dst = 't_rto_dst_' + uuid.uuid4().hex[:8] + cols = 'ts, ip, gh, c' + try: + self._exec( + f'CREATE TABLE {src} ' + '(ts TIMESTAMP, ip IPV4, gh GEOHASH(4c), c CHAR) ' + 'TIMESTAMP(ts) PARTITION BY DAY WAL') + self._exec( + f"INSERT INTO {src} VALUES " + f"('2024-01-01T00:00:00Z', '1.2.3.4', #u33d, 'A'), " + f"('2024-01-01T00:00:01Z', '255.0.0.1', #u33e, 'B')") + self.qdb_plain.retry_check_table(src, min_rows=2) + + with qi.Client.from_conf(self._conf()) as client: + df = client.query( + f'SELECT {cols} FROM {src} ORDER BY ts').to_pandas() + meta = df.attrs['questdb']['columns'] + self.assertEqual(meta['ip']['kind'], 'ipv4') + self.assertEqual(meta['c']['kind'], 'char') + self.assertEqual(meta['gh']['kind'], 'geohash') + self.assertEqual(meta['gh']['precision_bits'], 20) + + with qi.Client.from_conf(self._conf()) as client: + client.dataframe(df, table_name=dst, at='ts') + self.qdb_plain.retry_check_table(dst, min_rows=2) + + with qi.Client.from_conf(self._conf()) as client: + back = client.query( + f'SELECT ip, gh, c FROM {dst}').to_pandas() + bmeta = back.attrs['questdb']['columns'] + self.assertEqual(bmeta['ip']['kind'], 'ipv4') + self.assertEqual(bmeta['c']['kind'], 'char') + self.assertEqual(bmeta['gh']['kind'], 'geohash') + self.assertEqual(bmeta['gh']['precision_bits'], 20) + finally: + for t in (src, dst): + try: + self._exec(f'DROP TABLE IF EXISTS {t}') + except Exception: + pass + + def test_null_round_trip_per_dtype_backend(self): + """Pin the null contract across the three dtype_backend variants. + + The QuestDB QWP egress wire carries an explicit validity bitmap + (questdb-rs/src/egress/decoder.rs::ColumnBuffer.validity), so + Arrow consumers see real nulls — not sentinel masquerade. This + test inserts SQL NULL values and verifies what each mapper + surfaces: + + - default (native hybrid): a nullable LONG with nulls becomes + pandas Int64 (pd.NA); DOUBLE null is NaN; VARCHAR null is None. + - dtype_backend="pyarrow": ArrowDtype preserves null as pd.NA. + - dtype_backend="numpy_nullable": Int64Dtype/Float64Dtype/ + StringDtype preserve null as pd.NA. + + Also verifies that QuestDB's storage sentinel-collision + contract holds in the other direction: a real INT64_MIN + ingested as a value comes back as null. + """ + import pandas as pd + import pyarrow as pa + import numpy as np + table_name = 't_egress_nulls_' + uuid.uuid4().hex[:8] + try: + self._exec( + f'CREATE TABLE {table_name} ' + '(ts TIMESTAMP, lg LONG, db DOUBLE, vc VARCHAR) ' + 'TIMESTAMP(ts) PARTITION BY DAY WAL') + # Row 0: all values populated. + # Row 1: all nullable columns NULL. + self._exec( + f"INSERT INTO {table_name} VALUES " + f"('2024-01-01T00:00:00Z', 42, 3.5, 'hello'), " + f"('2024-01-01T00:00:01Z', NULL, NULL, NULL)") + self.qdb_plain.retry_check_table(table_name, min_rows=2) + + sql = f'SELECT lg, db, vc FROM {table_name} ORDER BY ts' + + # 1. Arrow level: verify the validity bitmap arrived. + with qi.Client.from_conf(self._conf()) as client: + table = client.query(sql).to_arrow() + lg_col = table.column('lg') + self.assertEqual(lg_col.null_count, 1, + f'expected 1 null on row 1; got {lg_col.null_count}') + self.assertFalse(lg_col.is_null()[0].as_py()) + self.assertTrue(lg_col.is_null()[1].as_py()) + self.assertEqual(table.column('db').null_count, 1) + self.assertEqual(table.column('vc').null_count, 1) + + # 2. default to_pandas — native hybrid: a nullable LONG with + # nulls becomes Int64 (pd.NA); DOUBLE null is NaN; VARCHAR + # null is None. + with qi.Client.from_conf(self._conf()) as client: + default = client.query(sql).to_pandas() + self.assertEqual(str(default['lg'].dtype), 'Int64') + self.assertEqual(default['lg'].iloc[0], 42) + self.assertTrue(default['lg'].iloc[1] is pd.NA) + self.assertTrue(pd.api.types.is_float_dtype(default['db'].dtype)) + self.assertEqual(default['db'].iloc[0], 3.5) + self.assertTrue(pd.isna(default['db'].iloc[1])) + self.assertEqual(default['vc'].iloc[0], 'hello') + self.assertTrue(pd.isna(default['vc'].iloc[1])) + + # 3. pyarrow-backed to_pandas — pd.NA preserved. + with qi.Client.from_conf(self._conf()) as client: + arrow_backed = client.query(sql).to_pandas( + dtype_backend='pyarrow') + self.assertIsInstance(arrow_backed['lg'].dtype, pd.ArrowDtype) + self.assertEqual(arrow_backed['lg'].iloc[0], 42) + self.assertTrue(arrow_backed['lg'].iloc[1] is pd.NA) + self.assertTrue(arrow_backed['db'].iloc[1] is pd.NA) + self.assertTrue(arrow_backed['vc'].iloc[1] is pd.NA) + + # 4. numpy_nullable to_pandas — pd.NA preserved via + # Int64Dtype / Float64Dtype / StringDtype. + with qi.Client.from_conf(self._conf()) as client: + nullable = client.query(sql).to_pandas( + dtype_backend='numpy_nullable') + self.assertIsInstance(nullable['lg'].dtype, pd.Int64Dtype) + self.assertEqual(nullable['lg'].iloc[0], 42) + self.assertTrue(nullable['lg'].iloc[1] is pd.NA) + self.assertIsInstance(nullable['db'].dtype, pd.Float64Dtype) + self.assertTrue(nullable['db'].iloc[1] is pd.NA) + self.assertIsInstance(nullable['vc'].dtype, pd.StringDtype) + self.assertTrue(nullable['vc'].iloc[1] is pd.NA) + finally: + try: + self._exec(f'DROP TABLE IF EXISTS {table_name}') + except Exception: + pass + + def test_sentinel_collision_is_documented_lossy(self): + """Verify QuestDB's storage-level sentinel-collision contract: + a user-supplied INT64_MIN value ingested as a LONG is folded + into NULL by the server. This is QuestDB's docs (see + plan-egress-to-pandas.md "Unavoidable lossy scenarios"); we + pin it here so a future server-side fix would be flagged. + """ + import pandas as pd + import numpy as np + table_name = 't_egress_sentinel_' + uuid.uuid4().hex[:8] + try: + self._exec( + f'CREATE TABLE {table_name} ' + '(ts TIMESTAMP, lg LONG) ' + 'TIMESTAMP(ts) PARTITION BY DAY WAL') + # Ingest via Client.dataframe — the python int range + # accepts INT64_MIN cleanly, sidestepping the SQL + # parser ambiguity around the literal. + # Also exercise tz-aware ingest (was rejected by columnar v1 + # until commit 9db3325 follow-up). Use the trailing 'Z' form + # that pd.to_datetime infers as DatetimeTZDtype. + df = pd.DataFrame({ + 'ts': pd.to_datetime([ + '2024-01-01T00:00:00Z', + '2024-01-01T00:00:01Z']), + 'lg': np.array( + [42, np.iinfo(np.int64).min], dtype=np.int64), + }) + with qi.Client.from_conf(self._conf()) as client: + client.dataframe(df, table_name=table_name, at='ts') + self.qdb_plain.retry_check_table(table_name, min_rows=2) + + sql = f'SELECT lg FROM {table_name} ORDER BY ts' + with qi.Client.from_conf(self._conf()) as client: + table = client.query(sql).to_arrow() + + # The INT64_MIN row collapses to NULL server-side. + self.assertEqual( + table.column('lg').null_count, 1, + 'expected the INT64_MIN row to be folded into NULL ' + 'by QuestDB storage; a non-zero null_count of 1 ' + 'pins that contract') + self.assertFalse(table.column('lg').is_null()[0].as_py()) + self.assertTrue(table.column('lg').is_null()[1].as_py()) + self.assertEqual(table.column('lg')[0].as_py(), 42) + finally: + try: + self._exec(f'DROP TABLE IF EXISTS {table_name}') + except Exception: + pass + + def test_sequential_queries_on_one_client(self): + """Open one Client, run several queries in sequence. Catches + regressions in any per-call reader/cursor lifecycle assumption. + Pool-reuse assertions live in ``TestEgressPool`` so this test + stays focused on the per-query result shape. + """ + table_name = 't_egress_seq_' + uuid.uuid4().hex[:8] + try: + self._exec( + f'CREATE TABLE {table_name} ' + '(ts TIMESTAMP, x LONG) ' + 'TIMESTAMP(ts) PARTITION BY DAY WAL') + self._exec( + f"INSERT INTO {table_name} VALUES " + f"('2024-01-01T00:00:00Z', 1), " + f"('2024-01-01T00:00:01Z', 2), " + f"('2024-01-01T00:00:02Z', 3)") + self.qdb_plain.retry_check_table(table_name, min_rows=3) + + with qi.Client.from_conf(self._conf()) as client: + first = client.query( + f'SELECT count() FROM {table_name}').to_arrow() + self.assertEqual(first.num_rows, 1) + self.assertEqual(first.column(0).to_pylist(), [3]) + + second = client.query( + f'SELECT x FROM {table_name} ORDER BY x').to_arrow() + self.assertEqual(second.num_rows, 3) + self.assertEqual( + second.column('x').to_pylist(), [1, 2, 3]) + + third = client.query( + f'SELECT x FROM {table_name} WHERE x > 1 ' + f'ORDER BY x').to_arrow() + self.assertEqual(third.num_rows, 2) + self.assertEqual( + third.column('x').to_pylist(), [2, 3]) + finally: + try: + self._exec(f'DROP TABLE IF EXISTS {table_name}') + except Exception: + pass + + +class TestEgressPool(unittest.TestCase): + """Structural tests for the ``questdb_db`` egress reader pool. + + Asserts behaviours the per-feature tests in + ``TestEgressWithDatabase`` exercise the code path of but don't + individually pin. Concurrency tests use ``threading.Barrier`` + + fixed iteration counts so they're deterministic — no ``sleep`` + or wall-clock dependencies. All tests run whenever the system- + test fixture is available (``QDB_REPO_PATH`` set); no separate + stress-mode gate. + """ + + @classmethod + def setUpClass(cls): + TestWithDatabase.setUpClass.__func__(cls) + + @classmethod + def tearDownClass(cls): + TestWithDatabase.tearDownClass.__func__(cls) + + def _require_qwp_ws(self): + if self.qdb_plain.version < FIRST_QWP_WS_RELEASE: + self.skipTest( + 'QWP/WebSocket integration tests require QuestDB 9.4.3+') + + def setUp(self): + self._require_qwp_ws() + + def _conf(self, **extra): + conf = (f'qwpws::addr={self.qdb_plain.host}:' + f'{self.qdb_plain.http_server_port};') + for k, v in extra.items(): + conf += f'{k}={v};' + return conf + + def _seed_table(self, n_rows=3): + """Create a small table and return its name. The pool tests + below all just need *something* queryable; one shared shape + keeps them simple.""" + table = 't_egress_pool_' + uuid.uuid4().hex[:8] + self.qdb_plain.http_sql_query( + f'CREATE TABLE {table} ' + '(ts TIMESTAMP, x LONG) ' + 'TIMESTAMP(ts) PARTITION BY DAY WAL') + # Use one-second steps but stay within a single minute by + # rolling over via minutes — keeps SQL literals trivially + # valid for n_rows up to 60*60. + values = ','.join( + f"('2024-01-01T00:{i // 60:02d}:{i % 60:02d}Z', {i})" + for i in range(n_rows)) + self.qdb_plain.http_sql_query( + f'INSERT INTO {table} VALUES {values}') + self.qdb_plain.retry_check_table(table, min_rows=n_rows) + self.addCleanup( + lambda: self._drop_quietly(table)) + return table + + def _drop_quietly(self, table): + try: + self.qdb_plain.http_sql_query(f'DROP TABLE IF EXISTS {table}') + except Exception: + pass + + # ------------------------------------------------------------------ + # Pool reuse — the architecture's primary promise + # ------------------------------------------------------------------ + + def test_idle_grows_on_sequential_use(self): + """After N sequential queries on one Client the pool holds + exactly one idle reader. (The lifted-out pool-reuse assertion + previously in test_sequential_queries_on_one_client.) + """ + table = self._seed_table(n_rows=3) + with qi.Client.from_conf(self._conf()) as client: + for _ in range(5): + client.query(f'SELECT count() FROM {table}').to_arrow() + in_use, idle = qi._debug_egress_pool_stats(client) + self.assertEqual(in_use, 0) + self.assertEqual( + idle, 1, + f'expected 1 idle reader cached across 5 queries; ' + f'got in_use={in_use}, idle={idle}') + + # ------------------------------------------------------------------ + # Arc lifeline — silent UAF if it regresses + # ------------------------------------------------------------------ + + def test_query_after_client_close_via_held_iterator(self): + """The architecture promises that ``Client.close()`` can free + the user-facing handle while a still-streaming cursor exists. + The ``Arc`` inside ``line_reader.ownership.Pooled`` + is what keeps the pool's transport alive across that window. + + We exercise it directly: open a client, start consuming a + query lazily, close the client mid-stream, then drain the + rest. A regression that replaced the Arc with a raw pointer + would surface as a use-after-free here. + """ + table = self._seed_table(n_rows=64) + client = qi.Client.from_conf(self._conf()) + try: + result = client.query(f'SELECT x FROM {table} ORDER BY x') + it = result.iter_arrow() + first = next(it) + client.close() + rest = list(it) + total_rows = first.num_rows + sum(b.num_rows for b in rest) + self.assertEqual(total_rows, 64) + finally: + client.close() + + # ------------------------------------------------------------------ + # must_close — silent corruption if a broken reader gets recycled + # ------------------------------------------------------------------ + + def test_must_close_drops_broken_reader_from_pool(self): + """Abandoning a cursor mid-stream causes the Rust + ``Cursor::Drop`` to close the transport (because + ``cursor_active`` is still true at drop time). The Python + ``_ReaderHandle`` defaults to ``_must_close=True``, so on + dealloc the reader is dropped — not recycled — and the next + borrower gets a fresh handshake instead of a broken pipe. + """ + import gc + table = self._seed_table(n_rows=64) + with qi.Client.from_conf(self._conf()) as client: + # Seed the pool with a fully-drained reader so idle==1. + client.query(f'SELECT count() FROM {table}').to_arrow() + in_use, idle = qi._debug_egress_pool_stats(client) + self.assertEqual((in_use, idle), (0, 1)) + + # Abandon a cursor mid-stream. The generator's `finally` + # frees the cursor, but `cursor_active` was still true at + # the time of free — so the Rust transport was torn down. + # The reader handle must NOT be returned to the idle list. + result = client.query(f'SELECT x FROM {table} ORDER BY x') + it = result.iter_arrow() + next(it) + del it + del result + gc.collect() + + in_use, idle = qi._debug_egress_pool_stats(client) + self.assertEqual( + in_use, 0, + f'leaked in-use after abandon; got ' + f'in_use={in_use}, idle={idle}') + self.assertEqual( + idle, 0, + f'broken reader was recycled instead of dropped; got ' + f'in_use={in_use}, idle={idle}. A subsequent query ' + f'would have hit a broken pipe.') + + # Next query must succeed against a fresh reader. + result = client.query( + f'SELECT count() FROM {table}').to_arrow() + self.assertEqual(result.column(0).to_pylist(), [64]) + # Pool re-grew by one for the fresh borrow. + in_use, idle = qi._debug_egress_pool_stats(client) + self.assertEqual((in_use, idle), (0, 1)) + + # ------------------------------------------------------------------ + # pool_max — the InvalidApiCall("pool exhausted") error path + # ------------------------------------------------------------------ + + def test_pool_max_exhausted_raises_not_hangs(self): + """When the pool is at ``pool_max`` and a second borrow is + attempted, the Rust side returns + ``InvalidApiCall("Reader pool exhausted")``. Verify it + surfaces as an ``IngressError``, not a hang or generic + socket error.""" + table = self._seed_table(n_rows=64) + conf = self._conf(pool_size='1', pool_max='1') + with qi.Client.from_conf(conf) as client: + # Hold one reader by starting an iterator and not + # exhausting it. + held_result = client.query( + f'SELECT x FROM {table} ORDER BY x') + held_it = held_result.iter_arrow() + next(held_it) + try: + in_use, _ = qi._debug_egress_pool_stats(client) + self.assertEqual( + in_use, 1, + f'expected 1 in-use reader for the held cursor; ' + f'got in_use={in_use}') + + # Second borrow must error, not block. + with self.assertRaises(qi.IngressError) as cm: + client.query( + f'SELECT count() FROM {table}').to_arrow() + msg = str(cm.exception).lower() + self.assertTrue( + 'exhausted' in msg or 'pool' in msg, + f'expected pool-exhaustion message; got ' + f'{cm.exception!r}') + finally: + # Drain the held iterator so the pool is releaseable. + list(held_it) + + # ------------------------------------------------------------------ + # Conf-string acceptance — BLOCKER 1 of the thermo-nuclear review + # ------------------------------------------------------------------ + + def test_pool_conf_keys_accepted_by_reader(self): + """The reader's conf parser was extended to accept ``qwpws::`` + / ``qwpwss::`` schemes and ignore ``pool_*`` keys. Verify + that a pool-configured Client produces a working egress + reader (a regression in the accept list would surface as a + ConfigError on the first ``query()``). + """ + table = self._seed_table(n_rows=3) + conf = self._conf( + pool_size='2', + pool_max='4', + pool_idle_timeout_ms='30000', + pool_reap='manual') + with qi.Client.from_conf(conf) as client: + r = client.query(f'SELECT count() FROM {table}').to_arrow() + self.assertEqual(r.column(0).to_pylist(), [3]) + + # ------------------------------------------------------------------ + # Concurrency — Barrier-synced, no sleep, deterministic + # ------------------------------------------------------------------ + + def test_concurrent_queries_share_pool(self): + """N threads × M queries on one Client with ``pool_size=K``. + Asserts: no exceptions; pool grew at most to ``K``; all + readers returned (``in_use==0`` at end); pool stays under + ``pool_max``. + """ + import threading + table = self._seed_table(n_rows=3) + conf = self._conf(pool_size='4', pool_max='8') + n_threads = 8 + per_thread = 25 + sql = f'SELECT count() FROM {table}' + + errors = [] + ready = threading.Barrier(n_threads) + + def worker(client): + try: + ready.wait(timeout=30) + for _ in range(per_thread): + client.query(sql).to_arrow() + except BaseException as e: + errors.append(repr(e)) + + with qi.Client.from_conf(conf) as client: + threads = [ + threading.Thread(target=worker, args=(client,)) + for _ in range(n_threads)] + for t in threads: + t.start() + for t in threads: + t.join(timeout=60) + + self.assertEqual( + errors, [], + f'{len(errors)}/{n_threads} workers errored: ' + f'{errors[:3]}') + for t in threads: + self.assertFalse(t.is_alive(), 'worker thread hung') + + in_use, idle = qi._debug_egress_pool_stats(client) + self.assertEqual( + in_use, 0, + f'workers returned but in_use={in_use}, ' + f'idle={idle}') + self.assertGreaterEqual(idle, 1) + self.assertLessEqual( + idle, 8, + f'idle={idle} exceeds pool_max=8 — auto-grow ' + f'overshot or returns leaked readers') + + def test_long_running_stream_does_not_starve_other_queries(self): + """Thread A holds a streaming cursor across a Barrier (one + batch pulled, one pending). Thread B runs M short queries on + the same Client. The pool must auto-grow to a second reader + for B; B must not wait for A. Pure correctness assertion; + no timing comparison. + """ + import threading + table = self._seed_table(n_rows=64) + conf = self._conf(pool_size='2', pool_max='4') + + a_progress = threading.Event() + b_done = threading.Event() + errors = [] + b_query_count = 16 + + with qi.Client.from_conf(conf) as client: + def slow_a(): + try: + result = client.query( + f'SELECT x FROM {table} ORDER BY x') + it = result.iter_arrow() + next(it) + a_progress.set() + # Wait for B to finish before draining the rest. + self.assertTrue(b_done.wait(timeout=60)) + list(it) # drain + except BaseException as e: + errors.append(('A', repr(e))) + + def fast_b(): + try: + self.assertTrue(a_progress.wait(timeout=30)) + for _ in range(b_query_count): + client.query( + f'SELECT count() FROM {table}').to_arrow() + b_done.set() + except BaseException as e: + errors.append(('B', repr(e))) + b_done.set() + + ta = threading.Thread(target=slow_a) + tb = threading.Thread(target=fast_b) + ta.start() + tb.start() + ta.join(timeout=60) + tb.join(timeout=60) + + self.assertEqual( + errors, [], + f'thread errored: {errors[:3]}') + self.assertFalse(ta.is_alive(), 'thread A hung') + self.assertFalse(tb.is_alive(), 'thread B hung') + + in_use, idle = qi._debug_egress_pool_stats(client) + self.assertEqual(in_use, 0) + # Pool must have grown to at least 2 (A held one, B + # borrowed at least one more). + self.assertGreaterEqual( + idle, 1, + f'pool did not retain any idle reader; ' + f'in_use={in_use}, idle={idle}') + + +class TestColumnIngressNarrowTypes(unittest.TestCase): + """End-to-end tests for the narrow Arrow primitive types added to + ``Client.dataframe`` column ingress: ``pa.int8/16/32`` → + BYTE/SHORT/INT, unsigned Arrow integers, ``pa.float16/32``, + and Arrow timestamp units through the QWP/WebSocket classifier. + + The contract: client-side dispatch is a pure function of the + Arrow input dtype (no content sniffing, no schema hints), and + target-column coercion (e.g. BYTE landing in a LONG column) is + handled server-side. Each happy-path test asserts the + round-trip identity through a fresh table; the coercion tests + pre-create the target column with a wider type and verify the + server narrows / widens correctly. + """ + + @classmethod + def setUpClass(cls): + TestWithDatabase.setUpClass.__func__(cls) + + @classmethod + def tearDownClass(cls): + TestWithDatabase.tearDownClass.__func__(cls) + + def _require_qwp_ws(self): + if self.qdb_plain.version < FIRST_QWP_WS_RELEASE: + self.skipTest( + 'QWP/WebSocket integration tests require QuestDB 9.4.3+') + + def setUp(self): + self._require_qwp_ws() + + def _conf(self): + return (f'qwpws::addr={self.qdb_plain.host}:' + f'{self.qdb_plain.http_server_port};') + + def _table(self, prefix='t_narrow_'): + name = prefix + uuid.uuid4().hex[:8] + self.addCleanup(lambda: self._drop_quietly(name)) + return name + + def _drop_quietly(self, table): + try: + self.qdb_plain.http_sql_query( + f'DROP TABLE IF EXISTS {table}') + except Exception: + pass + + def _create_table(self, table, value_col_sql): + """Pre-create the table with an explicit ``ts TIMESTAMP`` + designated column plus one value column. Pre-create rather + than rely on auto-infer so the timestamp column is + guaranteed to be named ``ts`` (auto-create renames to + ``timestamp``) and so the server pins the value column type + for the coercion / round-trip tests.""" + self.qdb_plain.http_sql_query( + f'CREATE TABLE {table} ' + f'(ts TIMESTAMP, {value_col_sql}) ' + 'TIMESTAMP(ts) PARTITION BY DAY WAL') + + def _make_df_with_ts(self, value_col_name, value_arr, n): + """Build a DataFrame with a designated-timestamp column and + a single value column. Keeps the per-test setup terse.""" + import pyarrow as pa + ts = pa.array( + [1700000000_000000 + i * 1_000_000 for i in range(n)], + type=pa.timestamp('us', tz='UTC')) + return pd.DataFrame({ + 'ts': pd.array(ts, dtype=pd.ArrowDtype(ts.type)), + value_col_name: pd.array( + value_arr, dtype=pd.ArrowDtype(value_arr.type)), + }) + + def _arrow_series(self, values, arrow_type): + import pyarrow as pa + arr = pa.array(values, type=arrow_type) + return pd.array(arr, dtype=pd.ArrowDtype(arr.type)) + + def _assert_table_empty(self, table): + with qi.Client.from_conf(self._conf()) as client: + got = client.query(f'SELECT count() FROM {table}').to_arrow() + self.assertEqual(got.column(0).to_pylist(), [0]) + + # ---------- happy-path round-trips ---------- + + def test_int8_round_trip(self): + """pa.int8 → BYTE wire → server stores as BYTE → egress + emits pa.int8. QuestDB BYTE is non-nullable; we stay inside + the value range [-127, 127] to avoid any sentinel ambiguity. + """ + import pyarrow as pa + self._require_qwp_ws() + table = self._table() + self._create_table(table, 'v BYTE') + values = pa.array([-127, -1, 0, 1, 127], type=pa.int8()) + df = self._make_df_with_ts('v', values, 5) + with qi.Client.from_conf(self._conf()) as client: + client.dataframe(df, table_name=table, at='ts') + self.qdb_plain.retry_check_table(table, min_rows=5) + with qi.Client.from_conf(self._conf()) as client: + got = client.query( + f'SELECT v FROM {table} ORDER BY ts').to_arrow() + self.assertEqual(got.column('v').type, pa.int8()) + self.assertEqual( + got.column('v').to_pylist(), [-127, -1, 0, 1, 127]) + + def test_int16_round_trip(self): + """pa.int16 → SHORT wire. SHORT is non-nullable; stay + inside [-32767, 32767] to avoid sentinel ambiguity.""" + import pyarrow as pa + self._require_qwp_ws() + table = self._table() + self._create_table(table, 'v SHORT') + values = pa.array( + [-32767, -1, 0, 1, 32767], type=pa.int16()) + df = self._make_df_with_ts('v', values, 5) + with qi.Client.from_conf(self._conf()) as client: + client.dataframe(df, table_name=table, at='ts') + self.qdb_plain.retry_check_table(table, min_rows=5) + with qi.Client.from_conf(self._conf()) as client: + got = client.query( + f'SELECT v FROM {table} ORDER BY ts').to_arrow() + self.assertEqual(got.column('v').type, pa.int16()) + self.assertEqual( + got.column('v').to_pylist(), + [-32767, -1, 0, 1, 32767]) + + def test_int32_round_trip(self): + """pa.int32 → INT wire. QuestDB INT uses INT32_MIN as the + null sentinel; we avoid it here and pin the sentinel + collision contract separately in + ``test_int32_min_collapses_to_null``. + """ + import pyarrow as pa + self._require_qwp_ws() + table = self._table() + self._create_table(table, 'v INT') + values = pa.array( + [-2147483647, -1, 0, 1, 2147483647], type=pa.int32()) + df = self._make_df_with_ts('v', values, 5) + with qi.Client.from_conf(self._conf()) as client: + client.dataframe(df, table_name=table, at='ts') + self.qdb_plain.retry_check_table(table, min_rows=5) + with qi.Client.from_conf(self._conf()) as client: + got = client.query( + f'SELECT v FROM {table} ORDER BY ts').to_arrow() + self.assertEqual(got.column('v').type, pa.int32()) + self.assertEqual( + got.column('v').to_pylist(), + [-2147483647, -1, 0, 1, 2147483647]) + + def test_float32_round_trip(self): + import pyarrow as pa + self._require_qwp_ws() + table = self._table() + self._create_table(table, 'v FLOAT') + values = pa.array( + [-1.5, 0.0, 0.5, 1.0, 3.14], type=pa.float32()) + df = self._make_df_with_ts('v', values, 5) + with qi.Client.from_conf(self._conf()) as client: + client.dataframe(df, table_name=table, at='ts') + self.qdb_plain.retry_check_table(table, min_rows=5) + with qi.Client.from_conf(self._conf()) as client: + got = client.query( + f'SELECT v FROM {table} ORDER BY ts').to_arrow() + self.assertEqual(got.column('v').type, pa.float32()) + self.assertEqual( + got.column('v').to_pylist(), + [-1.5, 0.0, 0.5, 1.0, 3.140000104904175]) + + def test_arrow_wide_numeric_sources_round_trip(self): + import pyarrow as pa + self._require_qwp_ws() + table = self._table() + self.qdb_plain.http_sql_query( + f'CREATE TABLE {table} ' + '(ts TIMESTAMP, arrow_l LONG, nullable_l LONG, ' + 'arrow_d DOUBLE, nullable_d DOUBLE) ' + 'TIMESTAMP(ts) PARTITION BY DAY WAL') + ts = pa.array( + [1700000000_000000, 1700000001_000000, 1700000002_000000], + type=pa.timestamp('us', tz='UTC')) + df = pd.DataFrame({ + 'ts': pd.array(ts, dtype=pd.ArrowDtype(ts.type)), + 'arrow_l': pd.Series( + pa.array([1, None, -3], type=pa.int64()), + dtype=pd.ArrowDtype(pa.int64())), + 'nullable_l': pd.Series( + [4, pd.NA, -6], dtype=pd.Int64Dtype()), + 'arrow_d': pd.Series( + pa.array([1.5, None, -3.25], type=pa.float64()), + dtype=pd.ArrowDtype(pa.float64())), + 'nullable_d': pd.Series( + [4.5, pd.NA, -6.25], dtype=pd.Float64Dtype()), + }) + with qi.Client.from_conf(self._conf()) as client: + client.dataframe(df, table_name=table, at='ts') + self.qdb_plain.retry_check_table(table, min_rows=3) + with qi.Client.from_conf(self._conf()) as client: + got = client.query( + f'SELECT arrow_l, nullable_l, arrow_d, nullable_d ' + f'FROM {table} ORDER BY ts').to_arrow() + self.assertEqual(got.column('arrow_l').type, pa.int64()) + self.assertEqual(got.column('nullable_l').type, pa.int64()) + self.assertEqual(got.column('arrow_d').type, pa.float64()) + self.assertEqual(got.column('nullable_d').type, pa.float64()) + self.assertEqual(got.column('arrow_l').to_pylist(), [1, None, -3]) + self.assertEqual(got.column('nullable_l').to_pylist(), [4, None, -6]) + self.assertEqual( + got.column('arrow_d').to_pylist(), [1.5, None, -3.25]) + self.assertEqual( + got.column('nullable_d').to_pylist(), [4.5, None, -6.25]) + + def test_large_utf8_round_trip(self): + import pyarrow as pa + self._require_qwp_ws() + table = self._table() + self._create_table(table, 'v VARCHAR') + values = pa.array( + ['alpha', None, 'gamma', 'delta', 'epsilon'], + type=pa.large_string()) + df = self._make_df_with_ts('v', values, 5) + with qi.Client.from_conf(self._conf()) as client: + client.dataframe(df, table_name=table, at='ts') + self.qdb_plain.retry_check_table(table, min_rows=5) + with qi.Client.from_conf(self._conf()) as client: + got = client.query( + f'SELECT v FROM {table} ORDER BY ts').to_arrow() + self.assertEqual(got.column('v').to_pylist(), values.to_pylist()) + + # ---------- null handling ---------- + + def test_short_is_non_nullable_nulls_become_zero(self): + """QuestDB SHORT is non-nullable: Arrow nulls written to a + SHORT column come back as 0, not preserved. This is a + QuestDB storage contract (no sentinel value for SHORT in + the existing schema), not a client-side bug. Pinned so a + future server-side fix (e.g., adding a SHORT null + sentinel) is flagged.""" + import pyarrow as pa + self._require_qwp_ws() + table = self._table() + self._create_table(table, 'v SHORT') + values = pa.array( + [-100, None, 0, None, 200], type=pa.int16()) + df = self._make_df_with_ts('v', values, 5) + with qi.Client.from_conf(self._conf()) as client: + client.dataframe(df, table_name=table, at='ts') + self.qdb_plain.retry_check_table(table, min_rows=5) + with qi.Client.from_conf(self._conf()) as client: + got = client.query( + f'SELECT v FROM {table} ORDER BY ts').to_arrow() + self.assertEqual(got.column('v').type, pa.int16()) + # Nulls flatten to 0; non-null values round-trip cleanly. + self.assertEqual( + got.column('v').to_pylist(), + [-100, 0, 0, 0, 200]) + self.assertEqual( + got.column('v').null_count, 0, + 'SHORT is non-nullable; nulls should be erased server-side') + + def test_int32_min_collapses_to_null(self): + """QuestDB INT uses INT32_MIN as the null sentinel — a + legitimate user value of INT32_MIN gets folded into NULL + on read. Same lossy contract as INT64_MIN → LONG NULL + pinned in ``test_sentinel_collision_is_documented_lossy``; + repeated here for INT so a regression on either type is + caught.""" + import pyarrow as pa + self._require_qwp_ws() + table = self._table() + self._create_table(table, 'v INT') + # INT32_MIN at index 0, ordinary value at index 1. + values = pa.array( + [-2147483648, 42], type=pa.int32()) + df = self._make_df_with_ts('v', values, 2) + with qi.Client.from_conf(self._conf()) as client: + client.dataframe(df, table_name=table, at='ts') + self.qdb_plain.retry_check_table(table, min_rows=2) + with qi.Client.from_conf(self._conf()) as client: + got = client.query( + f'SELECT v FROM {table} ORDER BY ts').to_arrow() + self.assertEqual(got.column('v').type, pa.int32()) + self.assertEqual( + got.column('v').null_count, 1, + 'expected the INT32_MIN row to be folded into NULL ' + 'by QuestDB INT storage') + self.assertTrue(got.column('v').is_null()[0].as_py()) + self.assertFalse(got.column('v').is_null()[1].as_py()) + self.assertEqual(got.column('v')[1].as_py(), 42) + + # ---------- server-side coercion ---------- + + def test_int8_into_existing_long_column_widens_server_side(self): + """Pre-create a LONG column and write ``pa.int8`` into it. + The server widens to LONG on insert (the policy-2 contract: + target-column coercion is the server's job).""" + import pyarrow as pa + self._require_qwp_ws() + table = self._table() + # Pre-create the table with v as LONG, not BYTE. + self.qdb_plain.http_sql_query( + f'CREATE TABLE {table} ' + '(ts TIMESTAMP, v LONG) ' + 'TIMESTAMP(ts) PARTITION BY DAY WAL') + values = pa.array([1, 2, 3, 4, 5], type=pa.int8()) + df = self._make_df_with_ts('v', values, 5) + with qi.Client.from_conf(self._conf()) as client: + client.dataframe(df, table_name=table, at='ts') + self.qdb_plain.retry_check_table(table, min_rows=5) + with qi.Client.from_conf(self._conf()) as client: + got = client.query( + f'SELECT v FROM {table} ORDER BY ts').to_arrow() + self.assertEqual(got.column('v').type, pa.int64()) + self.assertEqual( + got.column('v').to_pylist(), [1, 2, 3, 4, 5]) + + def test_float32_into_existing_double_column_widens(self): + import pyarrow as pa + self._require_qwp_ws() + table = self._table() + self.qdb_plain.http_sql_query( + f'CREATE TABLE {table} ' + '(ts TIMESTAMP, v DOUBLE) ' + 'TIMESTAMP(ts) PARTITION BY DAY WAL') + values = pa.array([0.5, 1.5, 2.5], type=pa.float32()) + df = self._make_df_with_ts('v', values, 3) + with qi.Client.from_conf(self._conf()) as client: + client.dataframe(df, table_name=table, at='ts') + self.qdb_plain.retry_check_table(table, min_rows=3) + with qi.Client.from_conf(self._conf()) as client: + got = client.query( + f'SELECT v FROM {table} ORDER BY ts').to_arrow() + self.assertEqual(got.column('v').type, pa.float64()) + self.assertEqual( + got.column('v').to_pylist(), [0.5, 1.5, 2.5]) + + # ---------- unhappy paths ---------- + + # ---------- UUID (Category C — canonical mirror + extension type) ---------- + + @staticmethod + def _uuid_to_wire(u): + """Convert a Python ``uuid.UUID`` to QuestDB's UUID wire + layout (the C header: "bytes 0..8 lo half LE, + bytes 8..16 hi half LE"). ``uuid.UUID.bytes`` is big-endian + per RFC 4122; the wire layout is two 64-bit LE halves with + ``lo`` first.""" + b = u.bytes + return bytes(reversed(b[8:16])) + bytes(reversed(b[0:8])) + + @staticmethod + def _extract_uuid_storage(col): + """Return the FSB(16) storage bytes from an egress UUID + column, whether or not pyarrow has the `arrow.uuid` + extension type registered.""" + import pyarrow as pa + if isinstance(col.type, pa.BaseExtensionType): + return col.combine_chunks().storage.to_pylist() + return col.to_pylist() + + def test_uuid_round_trip_via_fsb16(self): + """``pa.fixed_size_binary(16)`` → UUID wire → server stores + as UUID → egress emits the same FSB(16) storage bytes. + Canonical mirror path: no extension type wrapping. Round-trip + is byte-identity at the Arrow wire level (the + `_uuid_to_wire` helper converts the user-facing UUID to that + layout up front).""" + import pyarrow as pa + import uuid as uuid_mod + self._require_qwp_ws() + table = self._table() + self._create_table(table, 'v UUID') + uuids = [uuid_mod.uuid4() for _ in range(5)] + wire_bytes = [self._uuid_to_wire(u) for u in uuids] + values = pa.array(wire_bytes, type=pa.binary(16)) + df = self._make_df_with_ts('v', values, 5) + with qi.Client.from_conf(self._conf()) as client: + client.dataframe(df, table_name=table, at='ts') + self.qdb_plain.retry_check_table(table, min_rows=5) + with qi.Client.from_conf(self._conf()) as client: + got = client.query( + f'SELECT v FROM {table} ORDER BY ts').to_arrow() + self.assertEqual(self._extract_uuid_storage(got.column('v')), + wire_bytes) + + def test_uuid_round_trip_via_arrow_uuid_extension(self): + """If pyarrow has registered the `arrow.uuid` extension + type, ingress accepts it directly: we unwrap to the FSB(16) + storage type and dispatch identically to the canonical + mirror path.""" + import pyarrow as pa + import uuid as uuid_mod + self._require_qwp_ws() + try: + uuid_type = pa.uuid() + except (AttributeError, TypeError): + self.skipTest( + 'pyarrow.uuid() not available in this pyarrow build') + table = self._table() + self._create_table(table, 'v UUID') + uuids = [uuid_mod.uuid4() for _ in range(3)] + wire_bytes = [self._uuid_to_wire(u) for u in uuids] + values = pa.ExtensionArray.from_storage( + uuid_type, + pa.array(wire_bytes, type=pa.binary(16))) + df = self._make_df_with_ts('v', values, 3) + with qi.Client.from_conf(self._conf()) as client: + client.dataframe(df, table_name=table, at='ts') + self.qdb_plain.retry_check_table(table, min_rows=3) + with qi.Client.from_conf(self._conf()) as client: + got = client.query( + f'SELECT v FROM {table} ORDER BY ts').to_arrow() + self.assertEqual(self._extract_uuid_storage(got.column('v')), + wire_bytes) + + def test_uuid_with_nulls_round_trip(self): + """UUID validity bitmap round-trips: nulls stay null.""" + import pyarrow as pa + import uuid as uuid_mod + self._require_qwp_ws() + table = self._table() + self._create_table(table, 'v UUID') + w0 = self._uuid_to_wire(uuid_mod.uuid4()) + w2 = self._uuid_to_wire(uuid_mod.uuid4()) + w4 = self._uuid_to_wire(uuid_mod.uuid4()) + values = pa.array( + [w0, None, w2, None, w4], type=pa.binary(16)) + df = self._make_df_with_ts('v', values, 5) + with qi.Client.from_conf(self._conf()) as client: + client.dataframe(df, table_name=table, at='ts') + self.qdb_plain.retry_check_table(table, min_rows=5) + with qi.Client.from_conf(self._conf()) as client: + got = client.query( + f'SELECT v FROM {table} ORDER BY ts').to_arrow() + col = got.column('v') + self.assertEqual(self._extract_uuid_storage(col), + [w0, None, w2, None, w4]) + self.assertEqual(col.null_count, 2) + + def test_uuid_string_into_uuid_column_via_server_coercion(self): + """Strict-mirror policy: `pa.string()` always maps to + VARCHAR on the wire. When the target column is UUID, + QuestDB's server-side INSERT coercion narrows the VARCHAR + string into a UUID — the canonical "policy-2" contract + from the design doc.""" + import pyarrow as pa + import uuid as uuid_mod + self._require_qwp_ws() + table = self._table() + self._create_table(table, 'v UUID') + uuids = [uuid_mod.uuid4() for _ in range(3)] + values = pa.array([str(u) for u in uuids], type=pa.string()) + df = self._make_df_with_ts('v', values, 3) + with qi.Client.from_conf(self._conf()) as client: + client.dataframe(df, table_name=table, at='ts') + self.qdb_plain.retry_check_table(table, min_rows=3) + with qi.Client.from_conf(self._conf()) as client: + got = client.query( + f'SELECT v FROM {table} ORDER BY ts').to_arrow() + # Server-side coercion lands the value as a UUID; egress + # emits the FSB(16) storage in the same wire layout as + # the canonical mirror path. + expected = [self._uuid_to_wire(u) for u in uuids] + self.assertEqual(self._extract_uuid_storage(got.column('v')), + expected) + + def test_invalid_uuid_string_is_rejected_by_server(self): + """Bad UUID strings written into a UUID target surface as + an IngressError (server rejection), not a silent corruption + or a connection poisoning. Verification item #4 from the + design doc.""" + import pyarrow as pa + self._require_qwp_ws() + table = self._table() + self._create_table(table, 'v UUID') + values = pa.array( + ['not-a-uuid', 'also-not'], type=pa.string()) + df = self._make_df_with_ts('v', values, 2) + with qi.Client.from_conf(self._conf()) as client: + with self.assertRaises(qi.IngressError): + client.dataframe(df, table_name=table, at='ts') + + def test_fsb16_rejected_by_row_ilp(self): + """Row-ILP (`Sender.dataframe`) genuinely does not support + UUID. `_FIELD_TARGETS_ROW` doesn't include + `col_target_column_uuid`, so the resolver fails to map + `fsb16_arrow` to any target. This pins that + protocol-asymmetry contract.""" + import pyarrow as pa + import uuid as uuid_mod + self._require_qwp_ws() + values = pa.array( + [uuid_mod.uuid4().bytes for _ in range(2)], + type=pa.binary(16)) + df = self._make_df_with_ts('v', values, 2) + conf = ( + f'tcp::addr={self.qdb_plain.host}:' + f'{self.qdb_plain.line_tcp_port};') + with qi.Sender.from_conf(conf) as sender: + with self.assertRaises(qi.IngressError): + sender.dataframe(df, table_name='dummy', at='ts') + + def test_fsb_other_size_rejected(self): + """``FixedSizeBinary(k)`` for k != 16 is not UUID and has no + QuestDB analogue — should be rejected cleanly rather than + silently routed somewhere wrong.""" + import pyarrow as pa + self._require_qwp_ws() + table = self._table() + values = pa.array( + [b'\x00' * 8, b'\xff' * 8], type=pa.binary(8)) + df = self._make_df_with_ts('v', values, 2) + with qi.Client.from_conf(self._conf()) as client: + with self.assertRaises(qi.IngressError): + client.dataframe(df, table_name=table, at='ts') + + # ---------- UInt32 / IPV4 policy ---------- + + def test_pa_uint32_round_trip_as_long(self): + """Plain ``pa.uint32()`` widens to LONG on Client.dataframe. + + The Rust Arrow ingestion path reserves IPV4 for UInt32 fields + with ``questdb.column_type=ipv4`` metadata. Pandas drops Arrow + field metadata before it reaches this planner, so this path + follows the plain-UInt32 rule. + """ + import pyarrow as pa + self._require_qwp_ws() + table = self._table() + self._create_table(table, 'v LONG') + ints = [1, 2, 3, 0, 4294967295] + values = pa.array(ints, type=pa.uint32()) + df = self._make_df_with_ts('v', values, 5) + with qi.Client.from_conf(self._conf()) as client: + client.dataframe(df, table_name=table, at='ts') + self.qdb_plain.retry_check_table(table, min_rows=5) + with qi.Client.from_conf(self._conf()) as client: + got = client.query( + f'SELECT v FROM {table} ORDER BY ts').to_arrow() + self.assertEqual(got.column('v').type, pa.int64()) + self.assertEqual(got.column('v').to_pylist(), ints) + + def test_pa_uint64_within_i64_range_round_trips_as_long(self): + import pyarrow as pa + self._require_qwp_ws() + table = self._table() + self._create_table(table, 'v LONG') + ints = [0, 2 ** 63 - 1, 42] + values = pa.array(ints, type=pa.uint64()) + df = self._make_df_with_ts('v', values, 3) + with qi.Client.from_conf(self._conf()) as client: + client.dataframe(df, table_name=table, at='ts') + self.qdb_plain.retry_check_table(table, min_rows=3) + with qi.Client.from_conf(self._conf()) as client: + got = client.query( + f'SELECT v FROM {table} ORDER BY ts').to_arrow() + self.assertEqual(got.column('v').type, pa.int64()) + self.assertEqual(got.column('v').to_pylist(), ints) + + def test_arrow_classifier_numeric_mix_round_trips_on_real_server(self): + import pyarrow as pa + self._require_qwp_ws() + table = self._table() + ts_type = pa.timestamp('ms', tz='UTC') + df = pd.DataFrame({ + 'ts': pd.Series( + pa.array( + [1704067200000, 1704067201000, 1704067202000], + type=ts_type), + dtype=pd.ArrowDtype(ts_type)), + 'u8': pd.Series( + pa.array([1, 2, None], type=pa.uint8()), + dtype=pd.ArrowDtype(pa.uint8())), + 'u16': pd.Series( + pa.array([1000, None, 3000], type=pa.uint16()), + dtype=pd.ArrowDtype(pa.uint16())), + 'u32': pd.Series( + pa.array([1, 2 ** 31, 2 ** 32 - 1], type=pa.uint32()), + dtype=pd.ArrowDtype(pa.uint32())), + 'u64': pd.Series( + pa.array([1, 2 ** 63 - 1, None], type=pa.uint64()), + dtype=pd.ArrowDtype(pa.uint64())), + 'f16': pd.Series( + pa.array(np.array([1.5, 2.5, 3.5], dtype=np.float16), + type=pa.float16()), + dtype=pd.ArrowDtype(pa.float16())), + }) + + with qi.Client.from_conf(self._conf()) as client: + client.dataframe(df, table_name=table, at='ts') + self.qdb_plain.retry_check_table(table, min_rows=3) + with qi.Client.from_conf(self._conf()) as client: + got = client.query( + f'SELECT u8, u16, u32, u64, f16 FROM {table} ' + 'ORDER BY timestamp' + ).to_arrow() + + self.assertEqual(got.column('u8').type, pa.int32()) + self.assertEqual(got.column('u16').type, pa.int32()) + self.assertEqual(got.column('u32').type, pa.int64()) + self.assertEqual(got.column('u64').type, pa.int64()) + self.assertEqual(got.column('f16').type, pa.float32()) + self.assertEqual(got.column('u8').to_pylist(), [1, 2, None]) + self.assertEqual(got.column('u16').to_pylist(), [1000, None, 3000]) + self.assertEqual(got.column('u32').to_pylist(), [1, 2 ** 31, 2 ** 32 - 1]) + self.assertEqual(got.column('u64').to_pylist(), [1, 2 ** 63 - 1, None]) + self.assertEqual(got.column('f16').to_pylist(), [1.5, 2.5, 3.5]) + + def test_pa_uint64_above_i64_max_rejected_before_publish(self): + import pyarrow as pa + self._require_qwp_ws() + table = self._table() + self._create_table(table, 'v LONG') + values = pa.array([0, 2 ** 63], type=pa.uint64()) + df = self._make_df_with_ts('v', values, 2) + with qi.Client.from_conf(self._conf()) as client: + with self.assertRaisesRegex( + qi.IngressError, + r'UInt64 value 9223372036854775808 .* does not fit QuestDB LONG'): + client.dataframe(df, table_name=table, at='ts') + self._assert_table_empty(table) + + # ---------- TIMESTAMP validation policy ---------- + + def test_arrow_designated_timestamp_null_rejected_before_publish(self): + import pyarrow as pa + self._require_qwp_ws() + table = self._table() + self._create_table(table, 'v LONG') + ts_type = pa.timestamp('us', tz='UTC') + df = pd.DataFrame({ + 'ts': self._arrow_series( + [1700000000_000000, None], ts_type), + 'v': self._arrow_series([1, 2], pa.int64()), + }) + with qi.Client.from_conf(self._conf()) as client: + with self.assertRaises(qi.IngressError) as cm: + client.dataframe(df, table_name=table, at='ts') + self.assertIn('null', str(cm.exception).lower()) + self._assert_table_empty(table) + + def test_arrow_designated_timestamp_negative_rejected_before_publish(self): + import pyarrow as pa + self._require_qwp_ws() + table = self._table() + self._create_table(table, 'v LONG') + ts_type = pa.timestamp('us', tz='UTC') + df = pd.DataFrame({ + 'ts': self._arrow_series([-1, 1700000000_000000], ts_type), + 'v': self._arrow_series([1, 2], pa.int64()), + }) + with qi.Client.from_conf(self._conf()) as client: + with self.assertRaises(qi.IngressError) as cm: + client.dataframe(df, table_name=table, at='ts') + self.assertIn('unix epoch', str(cm.exception).lower()) + self._assert_table_empty(table) + + def test_arrow_designated_timestamp_ms_s_units_widen_to_micros(self): + """ArrowDtype ``timestamp('ms')`` / ``timestamp('s')`` + designated-``at`` columns are widened to microseconds in Rust by + the millis/seconds designated-timestamp FFI (no client-side + cast). A mixed frame (numpy value column) forces the manual + columnar planner, the path that routes these units to the new + FFI. The sub-second 'ms' value proves the scale is applied rather + than the raw value copied straight onto the micros wire.""" + import pyarrow as pa + import numpy as np + self._require_qwp_ws() + # 2023-06-15T12:34:56(.789) UTC, expressed in each unit. + for unit, raw, scale in (('ms', 1686832496789, 1000), + ('s', 1686832496, 1_000_000)): + table = self._table() + self._create_table(table, 'v LONG') + df = pd.DataFrame({ + 'ts': self._arrow_series( + [raw], pa.timestamp(unit, tz='UTC')), + 'v': pd.Series([1], dtype=np.int64), # numpy -> manual + }) + with qi.Client.from_conf(self._conf()) as client: + client.dataframe(df, table_name=table, at='ts') + self.qdb_plain.retry_check_table(table, min_rows=1) + with qi.Client.from_conf(self._conf()) as client: + got = client.query(f'SELECT ts FROM {table}').to_arrow() + self.assertEqual( + got.column('ts').type, pa.timestamp('us', tz='UTC')) + got_us = got.column('ts').cast(pa.int64()).to_pylist()[0] + self.assertEqual(got_us, raw * scale) + + def test_arrow_designated_timestamp_ms_null_rejected_before_publish(self): + """The plan validator's designated-timestamp null guard covers + the widened ms/s sources too, not just us/ns. A numpy value + column keeps the frame on the manual planner where the guard + runs.""" + import pyarrow as pa + import numpy as np + self._require_qwp_ws() + table = self._table() + self._create_table(table, 'v LONG') + df = pd.DataFrame({ + 'ts': self._arrow_series( + [1686832496789, None], pa.timestamp('ms', tz='UTC')), + 'v': pd.Series([1, 2], dtype=np.int64), # numpy -> manual + }) + with qi.Client.from_conf(self._conf()) as client: + with self.assertRaises( + qi.UnsupportedDataFrameShapeError) as cm: + client.dataframe(df, table_name=table, at='ts') + reasons = ' '.join( + f['reason'] for f in cm.exception.column_failures).lower() + self.assertIn('null', reasons) + self._assert_table_empty(table) + + def test_arrow_timestamp_field_null_rejected_before_publish(self): + import pyarrow as pa + self._require_qwp_ws() + table = self._table() + self._create_table(table, 'event_ts TIMESTAMP, v LONG') + ts_type = pa.timestamp('us', tz='UTC') + df = pd.DataFrame({ + 'ts': self._arrow_series( + [1700000000_000000, 1700000001_000000], ts_type), + 'event_ts': self._arrow_series( + [1700000002_000000, None], ts_type), + 'v': self._arrow_series([1, 2], pa.int64()), + }) + with qi.Client.from_conf(self._conf()) as client: + with self.assertRaises(qi.IngressError) as cm: + client.dataframe(df, table_name=table, at='ts') + self.assertIn('event_ts', str(cm.exception)) + self.assertIn('null', str(cm.exception).lower()) + self._assert_table_empty(table) + + def test_arrow_timestamp_field_negative_rejected_before_publish(self): + import pyarrow as pa + self._require_qwp_ws() + table = self._table() + self._create_table(table, 'event_ts TIMESTAMP, v LONG') + ts_type = pa.timestamp('us', tz='UTC') + df = pd.DataFrame({ + 'ts': self._arrow_series( + [1700000000_000000, 1700000001_000000], ts_type), + 'event_ts': self._arrow_series( + [-1, 1700000002_000000], ts_type), + 'v': self._arrow_series([1, 2], pa.int64()), + }) + with qi.Client.from_conf(self._conf()) as client: + with self.assertRaises(qi.IngressError) as cm: + client.dataframe(df, table_name=table, at='ts') + self.assertIn('event_ts', str(cm.exception)) + self.assertIn('unix epoch', str(cm.exception).lower()) + self._assert_table_empty(table) + + def test_arrow_multi_chunk_buffer_reuse_boundary_rows(self): + import pyarrow as pa + self._require_qwp_ws() + table = self._table() + self.qdb_plain.http_sql_query( + f'CREATE TABLE {table} ' + '(ts TIMESTAMP, seq LONG, price DOUBLE) ' + 'TIMESTAMP(ts) PARTITION BY DAY WAL') + rows = 64_001 + ts_values = 1_700_000_000_000_000 + np.arange(rows, dtype=np.int64) + seq_values = np.arange(rows, dtype=np.int64) + df = pd.DataFrame({ + 'ts': self._arrow_series( + ts_values, + pa.timestamp('us', tz='UTC')), + 'seq': self._arrow_series(seq_values, pa.int64()), + 'price': self._arrow_series( + seq_values.astype(np.float64) * 0.25, + pa.float64()), + }) + with qi.Client.from_conf(self._conf()) as client: + qi._debug_dataframe_columnar_io_stats(enabled=True, reset=True) + try: + client.dataframe( + df, table_name=table, at='ts', + max_rows_per_batch=32000) + finally: + io_stats = qi._debug_dataframe_columnar_io_stats( + enabled=False) + self.assertEqual(io_stats['flush_calls'], 3) + self.assertEqual(io_stats['sync_calls'], 1) + + self.qdb_plain.retry_check_table(table, min_rows=rows) + with qi.Client.from_conf(self._conf()) as client: + count = client.query( + f'SELECT count() FROM {table}').to_arrow() + expected_seq = [0, 31999, 32000, 32001, 63999, 64000] + got = client.query( + f'SELECT seq, price FROM {table} ' + f'WHERE seq IN ({", ".join(str(v) for v in expected_seq)}) ' + f'ORDER BY seq').to_arrow() + self.assertEqual(count.column(0).to_pylist(), [rows]) + self.assertEqual(got.column('seq').to_pylist(), expected_seq) + self.assertEqual( + got.column('price').to_pylist(), + [value * 0.25 for value in expected_seq]) + + def test_arrow_explicit_symbol_list_auto_creates_symbol_column(self): + import pyarrow as pa + self._require_qwp_ws() + table = self._table() + df = pd.DataFrame({ + 'ts': self._arrow_series( + [1700000000_000000, 1700000001_000000, 1700000002_000000], + pa.timestamp('us', tz='UTC')), + 'region': self._arrow_series( + ['us-east', 'us-west', 'us-east'], pa.string()), + 'note': self._arrow_series( + ['alpha', 'beta', 'gamma'], pa.string()), + 'seq': pd.Series([1, 2, 3], dtype='int64'), + }) + with qi.Client.from_conf(self._conf()) as client: + client.dataframe( + df, + table_name=table, + at='ts', + symbols=['region']) + + resp = self.qdb_plain.retry_check_table(table, min_rows=3) + col_types = {c['name']: c['type'] for c in resp['columns']} + self.assertEqual(col_types['region'], 'SYMBOL') + self.assertEqual(col_types['note'], 'VARCHAR') + self.assertEqual(col_types['seq'], 'LONG') + scrubbed = [row[:-1] for row in resp['dataset']] + self.assertEqual( + scrubbed, + [['us-east', 'alpha', 1], + ['us-west', 'beta', 2], + ['us-east', 'gamma', 3]]) + + def test_ipv4_string_coercion_is_unsupported(self): + """Unlike UUID (where the server parses VARCHAR strings + into UUIDs), QuestDB does NOT currently support VARCHAR → + IPV4 coercion at insert time. Writing `pa.string()` IP + addresses into an IPV4 column surfaces a server rejection + ("type coercion from VARCHAR to IPv4 is not supported"). + Pin this contract — if a future QuestDB release adds the + coercion, this test flips and the IPV4 path joins UUID's + string-coercion ergonomics.""" + import pyarrow as pa + self._require_qwp_ws() + table = self._table() + self._create_table(table, 'v IPV4') + ips = ['192.168.1.10', '10.0.0.1', '127.0.0.1'] + values = pa.array(ips, type=pa.string()) + df = self._make_df_with_ts('v', values, 3) + with qi.Client.from_conf(self._conf()) as client: + with self.assertRaises(qi.IngressError) as cm: + client.dataframe(df, table_name=table, at='ts') + self.assertIn('ipv4', str(cm.exception).lower()) + + def test_invalid_ipv4_string_is_rejected_by_server(self): + import pyarrow as pa + self._require_qwp_ws() + table = self._table() + self._create_table(table, 'v IPV4') + values = pa.array( + ['not-an-ip', '999.999.999.999'], type=pa.string()) + df = self._make_df_with_ts('v', values, 2) + with qi.Client.from_conf(self._conf()) as client: + with self.assertRaises(qi.IngressError): + client.dataframe(df, table_name=table, at='ts') + + def test_pa_uint32_is_routed_to_long_not_ipv4(self): + """Plain ``pa.uint32()`` is not enough to select IPV4.""" + import pyarrow as pa + self._require_qwp_ws() + table = self._table() + # Pre-create with an IPV4 column. The server will reject LONG + # wire values landing in IPV4 with a schema mismatch. + self._create_table(table, 'v IPV4') + values = pa.array([1, 2, 3], type=pa.uint32()) + df = self._make_df_with_ts('v', values, 3) + with qi.Client.from_conf(self._conf()) as client: + with self.assertRaises(qi.IngressError): + client.dataframe(df, table_name=table, at='ts') + + # ---------- LONG256 (Category C — FixedSizeBinary(32)) ---------- + + def test_long256_round_trip(self): + """``pa.fixed_size_binary(32)`` → LONG256 wire → server + stores as LONG256 → egress emits FSB(32). Bytes are + forwarded verbatim — same opaque-bytes convention as UUID + (matches Polars / Rust-direct: see PR #150).""" + import pyarrow as pa + self._require_qwp_ws() + table = self._table() + self._create_table(table, 'v LONG256') + # Use distinct 32-byte patterns. The QuestDB wire format + # for LONG256 is 4 LE 64-bit limbs, least-significant first. + v0 = bytes(range(32)) + v1 = bytes([i ^ 0xFF for i in range(32)]) + v2 = bytes([0] * 32) + values = pa.array([v0, v1, v2], type=pa.binary(32)) + df = self._make_df_with_ts('v', values, 3) + with qi.Client.from_conf(self._conf()) as client: + client.dataframe(df, table_name=table, at='ts') + self.qdb_plain.retry_check_table(table, min_rows=3) + with qi.Client.from_conf(self._conf()) as client: + got = client.query( + f'SELECT v FROM {table} ORDER BY ts').to_arrow() + col = got.column('v') + if isinstance(col.type, pa.BaseExtensionType): + got_bytes = col.combine_chunks().storage.to_pylist() + else: + got_bytes = col.to_pylist() + # v2 (all zeros) is the LONG256 null sentinel — server reads + # it back as NULL. Document this with the assertion. + self.assertEqual(got_bytes[0], v0) + self.assertEqual(got_bytes[1], v1) + # Index 2 may be None (null sentinel) — pin that contract. + self.assertIn(got_bytes[2], (v2, None)) + + def test_long256_with_nulls_round_trip(self): + import pyarrow as pa + self._require_qwp_ws() + table = self._table() + self._create_table(table, 'v LONG256') + v0 = bytes(range(32)) + v2 = bytes(range(32, 64)) + v4 = bytes([0xAB] * 32) + values = pa.array( + [v0, None, v2, None, v4], type=pa.binary(32)) + df = self._make_df_with_ts('v', values, 5) + with qi.Client.from_conf(self._conf()) as client: + client.dataframe(df, table_name=table, at='ts') + self.qdb_plain.retry_check_table(table, min_rows=5) + with qi.Client.from_conf(self._conf()) as client: + got = client.query( + f'SELECT v FROM {table} ORDER BY ts').to_arrow() + col = got.column('v') + if isinstance(col.type, pa.BaseExtensionType): + got_bytes = col.combine_chunks().storage.to_pylist() + else: + got_bytes = col.to_pylist() + self.assertEqual(got_bytes, [v0, None, v2, None, v4]) + + def test_fsb32_rejected_by_row_ilp(self): + """Row-ILP doesn't list `col_target_column_long256` in + `_FIELD_TARGETS_ROW`, so `Sender.dataframe` rejects FSB(32) + with `BadDataFrame`. Symmetric to the UUID FSB(16) row-ILP + rejection test in PR 2.""" + import pyarrow as pa + self._require_qwp_ws() + values = pa.array( + [bytes(range(32)), bytes(range(32, 64))], + type=pa.binary(32)) + df = self._make_df_with_ts('v', values, 2) + conf = ( + f'tcp::addr={self.qdb_plain.host}:' + f'{self.qdb_plain.line_tcp_port};') + with qi.Sender.from_conf(conf) as sender: + with self.assertRaises(qi.IngressError): + sender.dataframe(df, table_name='dummy', at='ts') + + def test_pa_uint8_round_trip_as_short(self): + """Plain ``pa.uint8()`` widens to SHORT on Client.dataframe.""" + import pyarrow as pa + self._require_qwp_ws() + table = self._table() + self._create_table(table, 'v SHORT') + values = pa.array([0, 1, 255], type=pa.uint8()) + df = self._make_df_with_ts('v', values, 3) + with qi.Client.from_conf(self._conf()) as client: + client.dataframe(df, table_name=table, at='ts') + self.qdb_plain.retry_check_table(table, min_rows=3) + with qi.Client.from_conf(self._conf()) as client: + got = client.query( + f'SELECT v FROM {table} ORDER BY ts').to_arrow() + self.assertEqual(got.column('v').type, pa.int16()) + self.assertEqual(got.column('v').to_pylist(), [0, 1, 255]) + + +class TestColumnIngressFailover(unittest.TestCase): + """Within-call failover for ``Client.dataframe`` (the column path). + + Connect-time / between-operation failover is automatic in Rust (the + next pool borrow auto-selects the live primary); these tests pin the + two cases the Python wrapper is responsible for: a dead+live endpoint + list (the borrow must skip the dead endpoint and land on the live + primary) and a mid-stream server bounce (the transient + ``FailoverRetry`` re-sends the whole df). Both routes — pandas/numpy + and the Arrow capsule (polars / pyarrow) — are covered. + """ + + @classmethod + def setUpClass(cls): + TestWithDatabase.setUpClass.__func__(cls) + + @classmethod + def tearDownClass(cls): + TestWithDatabase.tearDownClass.__func__(cls) + + def _require_qwp_ws(self): + if self.qdb_plain.version < FIRST_QWP_WS_RELEASE: + self.skipTest( + 'QWP/WebSocket integration tests require QuestDB 9.4.3+') + + def setUp(self): + self._require_qwp_ws() + + @staticmethod + def _unused_tcp_port(): + with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as sock: + sock.bind(('127.0.0.1', 0)) + return sock.getsockname()[1] + + def _conf(self, endpoints=None, **extra): + if endpoints is None: + endpoints = [ + (self.qdb_plain.host, self.qdb_plain.http_server_port)] + addr = ','.join(f'{h}:{p}' for h, p in endpoints) + conf = f'qwpws::addr={addr};' + for k, v in extra.items(): + conf += f'{k}={v};' + return conf + + def _sfa_conf(self, sender_id, sf_dir, endpoints=None, **extra): + sfa_extra = { + 'sender_id': sender_id, + 'sf_dir': sf_dir, + 'pool_size': '1', + 'pool_max': '1', + 'pool_reap': 'manual', + 'reconnect_max_duration_millis': '30000', + 'close_flush_timeout_millis': '30000', + } + sfa_extra.update(extra) + return self._conf(endpoints=endpoints, **sfa_extra) + + @staticmethod + def _sfa_file_count(sf_dir, sender_id): + slot_dir = pathlib.Path(sf_dir) / sender_id + if not slot_dir.exists(): + return 0 + return sum(1 for path in slot_dir.iterdir() + if path.name.endswith('.sfa')) + + def _table(self, prefix='t_fo_'): + name = prefix + uuid.uuid4().hex[:8] + self.addCleanup(lambda: self._drop_quietly(name)) + return name + + def _drop_quietly(self, table): + try: + self.qdb_plain.http_sql_query(f'DROP TABLE IF EXISTS {table}') + except Exception: + pass + + def _create_table(self, table): + self.qdb_plain.http_sql_query( + f'CREATE TABLE {table} (ts TIMESTAMP, v LONG) ' + 'TIMESTAMP(ts) PARTITION BY DAY WAL ' + 'DEDUP UPSERT KEYS(ts, v)') + + def _pandas_df(self, n): + ts = [1700000000_000000 + i * 1_000_000 for i in range(n)] + return pd.DataFrame({ + 'ts': pd.to_datetime(ts, unit='us'), + 'v': np.arange(n, dtype=np.int64), + }) + + def _arrow_df(self, n): + import pyarrow as pa + ts = pa.array( + [1700000000_000000 + i * 1_000_000 for i in range(n)], + type=pa.timestamp('us', tz='UTC')) + return pd.DataFrame({ + 'ts': pd.array(ts, dtype=pd.ArrowDtype(ts.type)), + 'v': pd.array( + pa.array(list(range(n)), type=pa.int64()), + dtype=pd.ArrowDtype(pa.int64())), + }) + + def _read_back_v(self, table): + with qi.Client.from_conf(self._conf()) as client: + got = client.query( + f'SELECT v FROM {table} ORDER BY ts').to_arrow() + return got.column('v').to_pylist() + + def test_sfa_dataframe_numpy_round_trip(self): + """SFA through the parent Python ``Client.dataframe`` NumPy path.""" + table = self._table('t_sfa_df_np_') + sender_id = 'py-df-np-' + uuid.uuid4().hex[:8] + self.qdb_plain.http_sql_query( + f'CREATE TABLE {table} ' + '(ts TIMESTAMP, v LONG, sym SYMBOL) ' + 'TIMESTAMP(ts) PARTITION BY DAY WAL ' + 'DEDUP UPSERT KEYS(ts, v)') + df = pd.DataFrame({ + 'ts': pd.to_datetime([ + 1_700_000_000_000_000, + 1_700_000_000_001_000, + 1_700_000_000_002_000, + ], unit='us'), + 'v': np.array([0, 1, 2], dtype=np.int64), + 'sym': pd.Categorical(['alpha', 'bravo', 'alpha']), + }) + + with tempfile.TemporaryDirectory(prefix='py-df-sfa-np-') as sf_dir: + with qi.Client.from_conf( + self._sfa_conf(sender_id, sf_dir)) as client: + client.dataframe( + df, table_name=table, at='ts', symbols=['sym']) + self.assertEqual(self._sfa_file_count(sf_dir, sender_id), 0) + + self.qdb_plain.retry_check_table(table, min_rows=3) + resp = self.qdb_plain.http_sql_query( + f'SELECT v, sym FROM {table} ORDER BY v') + self.assertEqual( + resp['dataset'], + [[0, 'alpha'], [1, 'bravo'], [2, 'alpha']]) + + def test_sfa_dataframe_arrow_round_trip(self): + """SFA through the parent Python ``Client.dataframe`` Arrow path.""" + if pyarrow is None: + self.skipTest('pyarrow not installed') + + table = self._table('t_sfa_df_arrow_') + sender_id = 'py-df-arrow-' + uuid.uuid4().hex[:8] + self.qdb_plain.http_sql_query( + f'CREATE TABLE {table} ' + '(ts TIMESTAMP, v LONG, sym SYMBOL) ' + 'TIMESTAMP(ts) PARTITION BY DAY WAL ' + 'DEDUP UPSERT KEYS(ts, v)') + + ts_type = pyarrow.timestamp('us', tz='UTC') + df = pd.DataFrame({ + 'ts': pd.Series( + pyarrow.array([ + 1_700_000_000_000_000, + 1_700_000_000_001_000, + 1_700_000_000_002_000, + ], type=ts_type), + dtype=pd.ArrowDtype(ts_type)), + 'v': pd.Series( + pyarrow.array([10, 11, 12], type=pyarrow.int64()), + dtype=pd.ArrowDtype(pyarrow.int64())), + 'sym': pd.Series( + pyarrow.array(['xray', 'yankee', 'xray'], + type=pyarrow.string()), + dtype=pd.ArrowDtype(pyarrow.string())), + }) + + with tempfile.TemporaryDirectory(prefix='py-df-sfa-arrow-') as sf_dir: + with qi.Client.from_conf( + self._sfa_conf(sender_id, sf_dir)) as client: + client.dataframe( + df, + table_name=table, + at='ts', + schema_overrides={'sym': 'symbol'}) + self.assertEqual(self._sfa_file_count(sf_dir, sender_id), 0) + + self.qdb_plain.retry_check_table(table, min_rows=3) + resp = self.qdb_plain.http_sql_query( + f'SELECT v, sym FROM {table} ORDER BY v') + self.assertEqual( + resp['dataset'], + [[10, 'xray'], [11, 'yankee'], [12, 'xray']]) + + def test_sfa_dataframe_rejection_reports_once_and_continues(self): + table = self._table('t_sfa_df_reject_') + sender_id = 'py-df-reject-' + uuid.uuid4().hex[:8] + self.qdb_plain.http_sql_query( + f'CREATE TABLE {table} ' + '(ts TIMESTAMP, v LONG, bad LONG) ' + 'TIMESTAMP(ts) PARTITION BY DAY WAL') + + valid1 = pd.DataFrame({ + 'ts': pd.to_datetime([1_700_000_000_000_000], unit='us'), + 'v': np.array([0], dtype=np.int64), + }) + rejected = pd.DataFrame({ + 'ts': pd.to_datetime([1_700_000_000_001_000], unit='us'), + 'bad': pd.Series(['not-a-long'], dtype=object), + }) + valid2 = pd.DataFrame({ + 'ts': pd.to_datetime([1_700_000_000_002_000], unit='us'), + 'v': np.array([2], dtype=np.int64), + }) + + with tempfile.TemporaryDirectory(prefix='py-df-sfa-reject-') as sf_dir: + with qi.Client.from_conf( + self._sfa_conf(sender_id, sf_dir)) as client: + client.dataframe(valid1, table_name=table, at='ts') + with self.assertRaises( + qi.IngressServerRejectionError) as raised: + client.dataframe(rejected, table_name=table, at='ts') + diagnostic = raised.exception.qwp_ws_error + self.assertIsNotNone(diagnostic) + self.assertEqual( + diagnostic.category, + qi.QwpWsErrorCategory.SchemaMismatch) + self.assertEqual( + diagnostic.applied_policy, + qi.QwpWsErrorPolicy.DropAndContinue) + self.assertEqual(diagnostic.status, 0x03) + self.assertEqual(diagnostic.from_fsn, 1) + self.assertEqual(diagnostic.to_fsn, 1) + + client.dataframe(valid2, table_name=table, at='ts') + + self.assertEqual(self._sfa_file_count(sf_dir, sender_id), 0) + + self.qdb_plain.retry_check_table(table, min_rows=2) + resp = self.qdb_plain.http_sql_query( + f'SELECT v FROM {table} ORDER BY v') + self.assertEqual(resp['dataset'], [[0], [2]]) + + def test_dead_then_live_endpoint_numpy_route(self): + """A dead first endpoint + the live primary: the pool borrow + rotates past the dead endpoint, the whole df lands. NumPy + (pandas) route.""" + table = self._table() + self._create_table(table) + endpoints = [ + (self.qdb_plain.host, self._unused_tcp_port()), + (self.qdb_plain.host, self.qdb_plain.http_server_port)] + conf = self._conf( + endpoints=endpoints, + reconnect_max_duration_millis='30000') + df = self._pandas_df(2000) + with qi.Client.from_conf(conf) as client: + client.dataframe(df, table_name=table, at='ts') + self.qdb_plain.retry_check_table(table, min_rows=2000) + self.assertEqual(self._read_back_v(table), list(range(2000))) + + def test_dead_then_live_endpoint_arrow_route(self): + """Same, via the Arrow capsule (pyarrow-backed) route.""" + table = self._table() + self._create_table(table) + endpoints = [ + (self.qdb_plain.host, self._unused_tcp_port()), + (self.qdb_plain.host, self.qdb_plain.http_server_port)] + conf = self._conf( + endpoints=endpoints, + reconnect_max_duration_millis='30000') + df = self._arrow_df(2000) + with qi.Client.from_conf(conf) as client: + client.dataframe(df, table_name=table, at='ts') + self.qdb_plain.retry_check_table(table, min_rows=2000) + self.assertEqual(self._read_back_v(table), list(range(2000))) + + def test_polars_dataframe_round_trip(self): + """``pl.DataFrame`` (the Arrow capsule route) lands every row; + pins that the polars source feeds the same whole-df path.""" + try: + import polars as pl + except ImportError: + self.skipTest('polars not installed') + table = self._table() + self._create_table(table) + df = pl.DataFrame({ + 'ts': [ + datetime.datetime(2023, 11, 14, 22, 13, 20, + tzinfo=datetime.timezone.utc) + + datetime.timedelta(seconds=i) + for i in range(1500)], + 'v': list(range(1500)), + }) + with qi.Client.from_conf(self._conf()) as client: + client.dataframe(df, table_name=table, at='ts') + self.qdb_plain.retry_check_table(table, min_rows=1500) + self.assertEqual(self._read_back_v(table), list(range(1500))) + + def test_mid_stream_bounce_resends_whole_df_numpy(self): + """Bounce the server mid-call: the transient ``FailoverRetry`` + re-sends the whole df on a fresh conn. DEDUP collapses any + duplicate prefix; the final row set is exact (no loss / no dup). + NumPy route.""" + table = self._table() + self._create_table(table) + df = self._pandas_df(20000) + with qi.Client.from_conf( + self._conf(reconnect_max_duration_millis='60000')) as client: + # Warm the pool so a live conn is idle, then bounce: the next + # borrow hands back that now-stale conn, the flush hits a dead + # socket -> FailoverRetry -> whole-df re-send on a reconnected + # primary. DEDUP collapses any duplicate prefix. + client.dataframe(self._pandas_df(2), table_name=table, at='ts') + self.qdb_plain.stop() + self.qdb_plain.start() + client.dataframe(df, table_name=table, at='ts') + self.qdb_plain.retry_check_table(table, min_rows=20000) + self.assertEqual(self._read_back_v(table), list(range(20000))) + + def test_mid_stream_bounce_resends_whole_df_arrow(self): + """Same bounce, Arrow capsule route.""" + table = self._table() + self._create_table(table) + df = self._arrow_df(20000) + with qi.Client.from_conf( + self._conf(reconnect_max_duration_millis='60000')) as client: + client.dataframe(self._arrow_df(2), table_name=table, at='ts') + self.qdb_plain.stop() + self.qdb_plain.start() + client.dataframe(df, table_name=table, at='ts') + self.qdb_plain.retry_check_table(table, min_rows=20000) + self.assertEqual(self._read_back_v(table), list(range(20000))) + + +class TestEgressFailover(unittest.TestCase): + """Egress read failover: materialise-whole = transparent (the reset + callback discards the partial accumulation and replays from + batch-0); streaming = explicit ``FailoverWouldDuplicate``.""" + + @classmethod + def setUpClass(cls): + TestWithDatabase.setUpClass.__func__(cls) + + @classmethod + def tearDownClass(cls): + TestWithDatabase.tearDownClass.__func__(cls) + + def _require_qwp_ws(self): + if self.qdb_plain.version < FIRST_QWP_WS_RELEASE: + self.skipTest( + 'QWP/WebSocket integration tests require QuestDB 9.4.3+') + + def setUp(self): + self._require_qwp_ws() + + @staticmethod + def _unused_tcp_port(): + with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as sock: + sock.bind(('127.0.0.1', 0)) + return sock.getsockname()[1] + + def _conf(self, endpoints=None, **extra): + if endpoints is None: + endpoints = [ + (self.qdb_plain.host, self.qdb_plain.http_server_port)] + addr = ','.join(f'{host}:{port}' for host, port in endpoints) + conf = f'qwpws::addr={addr};' + for k, v in extra.items(): + conf += f'{k}={v};' + return conf + + def _exec(self, sql): + return self.qdb_plain.http_sql_query(sql) + + def _drop_quietly(self, table): + try: + self._exec(f'DROP TABLE IF EXISTS {table}') + except Exception: + pass + + def _seed(self, n_rows): + """A multi-batch result: enough rows that QuestDB streams more + than one record batch, so a mid-stream bounce lands after the + first batch is delivered.""" + table = 't_egress_fo_' + uuid.uuid4().hex[:8] + self.addCleanup(lambda: self._drop_quietly(table)) + self._exec( + f'CREATE TABLE {table} (ts TIMESTAMP, v LONG) ' + 'TIMESTAMP(ts) PARTITION BY DAY WAL') + base = '2024-01-01T00:00:00.000000Z' + # Bulk-insert via a generator series keeps the SQL compact. + # long_sequence(n) yields x = 1..n; v = x - 1 gives 0..n-1. + self._exec( + f"INSERT INTO {table} " + f"SELECT timestamp_sequence('{base}', 1000) AS ts, x - 1 AS v " + f"FROM long_sequence({n_rows})") + self.qdb_plain.retry_check_table(table, min_rows=n_rows) + return table + + def test_materialise_whole_transparent_across_bounce(self): + """``to_arrow`` / ``to_pandas`` / ``to_polars`` complete with the + full, in-order result even when the server bounces mid-stream: + the installed reset callback discards the partial accumulation + and the query replays from batch-0.""" + n = 200000 + table = self._seed(n) + expected = list(range(n)) + + with qi.Client.from_conf( + self._conf(failover_max_duration_ms='60000')) as client: + result = client.query(f'SELECT v FROM {table} ORDER BY ts') + # Bounce before the (single-use) materialisation drains the + # stream: a mid-query failover re-executes and the reset + # discards anything already buffered. + self.qdb_plain.stop() + self.qdb_plain.start() + table_out = result.to_arrow() + self.assertEqual(table_out.column('v').to_pylist(), expected) + + def test_to_pandas_numpy_transparent_across_bounce(self): + """Default ``to_pandas`` (the numpy accumulator we own) is + likewise transparent across a bounce.""" + n = 200000 + table = self._seed(n) + expected = list(range(n)) + with qi.Client.from_conf( + self._conf(failover_max_duration_ms='60000')) as client: + result = client.query(f'SELECT v FROM {table} ORDER BY ts') + self.qdb_plain.stop() + self.qdb_plain.start() + df = result.to_pandas() + self.assertEqual(df['v'].tolist(), expected) + + def test_to_polars_transparent_across_bounce(self): + try: + import polars # noqa: F401 + except ImportError: + self.skipTest('polars not installed') + n = 200000 + table = self._seed(n) + expected = list(range(n)) + with qi.Client.from_conf( + self._conf(failover_max_duration_ms='60000')) as client: + result = client.query(f'SELECT v FROM {table} ORDER BY ts') + self.qdb_plain.stop() + self.qdb_plain.start() + df = result.to_polars() + self.assertEqual(df['v'].to_list(), expected) + + def test_to_polars_dead_then_live_endpoint(self): + """Polars materialisation uses the same reader failover walk as + the other egress adapters: a dead first endpoint is skipped and + the live standalone server satisfies ``target=primary``.""" + try: + import polars # noqa: F401 + except ImportError: + self.skipTest('polars not installed') + n = 2048 + table = self._seed(n) + endpoints = [ + (self.qdb_plain.host, self._unused_tcp_port()), + (self.qdb_plain.host, self.qdb_plain.http_server_port)] + with qi.Client.from_conf( + self._conf(endpoints=endpoints, + target='primary', + failover_max_duration_ms='60000')) as client: + df = client.query(f'SELECT v FROM {table} ORDER BY ts').to_polars() + self.assertEqual(df['v'].to_list(), list(range(n))) + + def test_polars_from_arrow_dead_then_live_endpoint(self): + """The pyarrow-free Polars capsule path also borrows through the + same multi-endpoint reader pool before Polars starts consuming + the Arrow stream.""" + try: + import polars as pl + except ImportError: + self.skipTest('polars not installed') + n = 2048 + table = self._seed(n) + endpoints = [ + (self.qdb_plain.host, self._unused_tcp_port()), + (self.qdb_plain.host, self.qdb_plain.http_server_port)] + with qi.Client.from_conf( + self._conf(endpoints=endpoints, + target='primary', + failover_max_duration_ms='60000')) as client: + with client.query(f'SELECT v FROM {table} ORDER BY ts') as result: + df = pl.from_arrow(result) + self.assertEqual(df['v'].to_list(), list(range(n))) + + def test_iter_arrow_surfaces_failover_would_duplicate(self): + """Streaming ``iter_arrow`` installs no reset: a mid-stream + failover after the first batch is delivered surfaces a clean, + catchable ``FailoverWouldDuplicate`` rather than silently + re-reading.""" + # Use a generated result large enough that the server is still + # producing after the first small batch. A pre-seeded table can + # finish and buffer before the graceful fixture bounce breaks the + # WebSocket, making the test depend on timing. + n = 100000000 + with qi.Client.from_conf( + self._conf(failover_max_duration_ms='60000', + max_batch_rows='1024')) as client: + it = client.query( + f'SELECT x - 1 AS v FROM long_sequence({n})').iter_arrow() + first = next(it) + self.assertGreater(first.num_rows, 0) + # First batch delivered; bounce so the next pull fails over. + self.qdb_plain.stop() + self.qdb_plain.start() + with self.assertRaises(qi.IngressError) as cm: + for _ in it: + pass + self.assertEqual( + cm.exception.code, + qi.IngressErrorCode.FailoverWouldDuplicate) + + def test_iter_pandas_surfaces_failover_would_duplicate(self): + """Same contract for the numpy streaming ``iter_pandas``.""" + n = 100000000 + with qi.Client.from_conf( + self._conf(failover_max_duration_ms='60000', + max_batch_rows='1024')) as client: + it = client.query( + f'SELECT x - 1 AS v FROM long_sequence({n})').iter_pandas() + first = next(it) + self.assertGreater(len(first), 0) + self.qdb_plain.stop() + self.qdb_plain.start() + with self.assertRaises(qi.IngressError) as cm: + for _ in it: + pass + self.assertEqual( + cm.exception.code, + qi.IngressErrorCode.FailoverWouldDuplicate) + + +class _FakeStatusServer: + """Port of ``QwpQueryClientMultiHostFailoverTest.FakeStatusServer``: a + raw loopback socket that answers every probe with a fixed HTTP status + (and optional ``X-QuestDB-Role`` header) and counts how many times it + was connected to. A real QuestDB always advertises a single role, so + role-negotiation failover can only be exercised against an in-process + fake that can pretend to be a REPLICA / return 401.""" + + def __init__(self, status_code, role_header=None): + self.status_code = status_code + self.role_header = role_header + self.connections = 0 + self._lock = threading.Lock() + self._sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) + self._sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1) + self._sock.bind(('127.0.0.1', 0)) + self._sock.listen(50) + self._running = True + + @property + def port(self): + return self._sock.getsockname()[1] + + def start(self): + threading.Thread(target=self._loop, daemon=True).start() + + def _loop(self): + while self._running: + try: + conn, _ = self._sock.accept() + except OSError: + return + threading.Thread( + target=self._handle, args=(conn,), daemon=True).start() + + def _handle(self, conn): + with conn: + # Increment before responding: the client cannot observe the + # HTTP status (and thus rotate / surface its error) until the + # response has been written, so a count read after the connect + # walk returns is guaranteed to have seen this probe. + with self._lock: + self.connections += 1 + try: + conn.recv(8192) + reason = {401: 'Unauthorized', + 421: 'Misdirected Request'}.get( + self.status_code, 'Status') + lines = [f'HTTP/1.1 {self.status_code} {reason}'] + if self.role_header: + lines.append(self.role_header) + lines.append('Content-Length: 0') + lines.append('Connection: close') + conn.sendall( + ('\r\n'.join(lines) + '\r\n\r\n').encode('ascii')) + except OSError: + pass + + def close(self): + self._running = False + try: + self._sock.close() + except OSError: + pass + + +class TestEgressFailoverRoleNegotiation(unittest.TestCase): + """Reader connect-time role/auth failover, ported from Java's + ``QwpQueryClientMultiHostFailoverTest``. The pool is opened eagerly by + ``Client.from_conf`` (``questdb_db_connect``), so the connect walk -- and + any role/auth error -- surfaces there rather than at ``query()`` (Java's + explicit ``connect()``). These run against in-process fakes only and need + no QuestDB instance.""" + + def _server(self, status_code, role_header=None): + srv = _FakeStatusServer(status_code, role_header) + self.addCleanup(srv.close) + srv.start() + return srv + + @staticmethod + def _conf(servers, **extra): + addr = ','.join(f'127.0.0.1:{s.port}' for s in servers) + conf = f'qwpws::addr={addr};' + for key, value in extra.items(): + conf += f'{key}={value};' + return conf + + def test_replica_then_401_fails_fast_with_auth(self): + """``[REPLICA(421), auth(401)]``: the 421 rotates past the replica, + the 401 on the second endpoint short-circuits the walk with an auth + error (not a generic socket/role error). Both endpoints are probed.""" + replica = self._server(421, 'X-QuestDB-Role: REPLICA') + auth = self._server(401) + conf = self._conf( + [replica, auth], + auth_timeout_ms=2000, failover='off', target='any') + with self.assertRaises(qi.IngressError) as cm: + qi.Client.from_conf(conf) + self.assertEqual(cm.exception.code, qi.IngressErrorCode.AuthError) + self.assertIn('401', str(cm.exception)) + self.assertGreaterEqual(replica.connections, 1) + self.assertGreaterEqual(auth.connections, 1) + + def test_all_replica_fails_with_role_mismatch(self): + """Every endpoint role-rejects: the surfaced error is a distinct + ``RoleMismatch`` (naming the unsuitable role), *not* ``AuthError`` + and *not* the generic ``SocketError`` used for "all unreachable". + This is the typed distinction Java draws with + ``QwpRoleMismatchException`` -- an operator can tell "no primary + elected yet" from "bad credentials" and from "everything is down". + Both replicas are probed.""" + r1 = self._server(421, 'X-QuestDB-Role: REPLICA') + r2 = self._server(421, 'X-QuestDB-Role: REPLICA') + conf = self._conf( + [r1, r2], + auth_timeout_ms=2000, failover='off', target='any') + with self.assertRaises(qi.IngressError) as cm: + qi.Client.from_conf(conf) + self.assertEqual(cm.exception.code, qi.IngressErrorCode.RoleMismatch) + self.assertIn('REPLICA', str(cm.exception)) + self.assertGreaterEqual(r1.connections, 1) + self.assertGreaterEqual(r2.connections, 1) + + def test_connect_does_not_double_walk_on_first_failure(self): + """With ``failover=off`` the initial connect walks the address list + exactly once: each role-rejecting endpoint is probed a single time + before the walk fails terminally -- no re-walking the list.""" + r1 = self._server(421, 'X-QuestDB-Role: REPLICA') + r2 = self._server(421, 'X-QuestDB-Role: REPLICA') + r3 = self._server(421, 'X-QuestDB-Role: REPLICA') + conf = self._conf( + [r1, r2, r3], + auth_timeout_ms=2000, failover='off', target='any') + with self.assertRaises(qi.IngressError) as cm: + qi.Client.from_conf(conf) + self.assertEqual(cm.exception.code, qi.IngressErrorCode.RoleMismatch) + self.assertEqual(r1.connections, 1) + self.assertEqual(r2.connections, 1) + self.assertEqual(r3.connections, 1) + if __name__ == '__main__': unittest.main() diff --git a/test/test.py b/test/test.py index 18f4461a..a9dc358c 100755 --- a/test/test.py +++ b/test/test.py @@ -7,9 +7,11 @@ import datetime import timeit import time +import threading from enum import Enum import random import pathlib +import tempfile import numpy as np import patch_path @@ -25,11 +27,16 @@ from mock_server import (Server, HttpServer, SETTINGS_WITHOUT_PROTOCOL_VERSION, SETTINGS_WITH_PROTOCOL_VERSION_V1, SETTINGS_WITH_PROTOCOL_VERSION_V2, SETTINGS_WITH_PROTOCOL_VERSION_V1_V2_V3,SETTINGS_WITH_PROTOCOL_VERSION_V4) +from qwp_ws_ack_server import QwpAckServer import questdb.ingress as qi if os.environ.get('TEST_QUESTDB_INTEGRATION') == '1': - from system_test import TestWithDatabase + from system_test import ( + TestWithDatabase, + TestEgressWithDatabase, + TestEgressPool, + TestColumnIngressNarrowTypes) from fixture import _parse_version @@ -38,20 +45,47 @@ try: import pandas as pd import numpy - import pyarrow except ImportError: pd = None -if pd is not None: +try: + import pyarrow +except ImportError: + pyarrow = None + +from test_client_capsule_path import ( + TestCapsulePathPyArrow, + TestCapsulePathPolars, + TestSchemaOverrides, + TestPyArrowRecordBatchDirect, + TestSchemaOverridesPandas, + TestBenchFlushArrowBatch, + TestCapsulePathPolarsMissing, + TestWriterMixingInOneChunk, + TestPandasPlannerRouting, +) +from test_client_dataframe_fuzz import ( + TestClientDataframeFuzz, + TestClientDataframeSfaFuzz, + TestClientDataframeRoundTrip, +) +from test_client_polars_fuzz import ( + TestClientPolarsDataframeFuzz, + TestClientPolarsDataframeSfaFuzz, + TestClientPolarsDataframeRoundTrip, +) +from test_dataframe_leaks import TestCategoricalArrowLeak, TestPyobjColumnarLeak + +if pd is not None and pyarrow is not None: from test_dataframe import TestPandasProtocolVersionV1 from test_dataframe import TestPandasProtocolVersionV2 from test_dataframe import TestPandasProtocolVersionV3 -else: +elif pd is None: class TestNoPandas(unittest.TestCase): def test_no_pandas(self): buf = qi.Buffer(protocol_version=2) - exp = 'Missing.*`pandas.*pyarrow`.*readthedocs.*installation.html.' - with self.assertRaisesRegex(ImportError, exp): + exp = 'Missing.*`pandas`.*`numpy`.*readthedocs.*installation.html.' + with self.assertRaisesRegex(qi.IngressError, exp): buf.dataframe(None, at=qi.ServerTimestamp) @@ -66,6 +100,552 @@ def test_valid_yaml(self): yaml.safe_load(f) +class TestQwpWebSocketApi(unittest.TestCase): + def test_protocol_enum(self): + self.assertEqual(qi.Protocol.parse('qwpws'), qi.Protocol.QwpWs) + self.assertEqual(qi.Protocol.parse('qwpwss'), qi.Protocol.QwpWss) + self.assertFalse(qi.Protocol.QwpWs.tls_enabled) + self.assertTrue(qi.Protocol.QwpWss.tls_enabled) + + def test_progress_enum(self): + self.assertEqual( + qi.QwpWsProgress.parse('background'), + qi.QwpWsProgress.Background) + self.assertEqual( + qi.QwpWsProgress.parse('manual'), + qi.QwpWsProgress.Manual) + + def test_ingress_error_can_carry_qwpws_diagnostic(self): + err = qi.IngressError( + qi.IngressErrorCode.SocketError, + 'sender halted', + ( + qi.QwpWsErrorCategory.ParseError.c_value, + qi.QwpWsErrorPolicy.Halt.c_value, + 2, + 'bad line', + 44, + 5, + 6, + )) + + diagnostic = err.qwp_ws_error + + self.assertEqual(diagnostic.category, qi.QwpWsErrorCategory.ParseError) + self.assertEqual(diagnostic.applied_policy, qi.QwpWsErrorPolicy.Halt) + self.assertEqual(diagnostic.status, 2) + self.assertEqual(diagnostic.message, 'bad line') + self.assertEqual(diagnostic.message_sequence, 44) + self.assertEqual(diagnostic.from_fsn, 5) + self.assertEqual(diagnostic.to_fsn, 6) + self.assertIs(err.qwp_ws_error, diagnostic) + + def test_server_rejection_error_is_specific_subclass(self): + err = qi.IngressServerRejectionError( + qi.IngressErrorCode.ServerRejection, + 'sender halted', + ( + qi.QwpWsErrorCategory.ParseError.c_value, + qi.QwpWsErrorPolicy.Halt.c_value, + 2, + 'bad line', + 44, + 5, + 6, + )) + + self.assertIsInstance(err, qi.IngressError) + self.assertEqual(err.code, qi.IngressErrorCode.ServerRejection) + self.assertEqual(err.qwp_ws_error.category, qi.QwpWsErrorCategory.ParseError) + + def test_python_only_error_codes_do_not_overlap_ffi_codes(self): + code_values = [code.value for code in qi.IngressErrorCode] + + self.assertEqual(len(code_values), len(set(code_values))) + # Iterating the enum skips aliases, so the check above passes even + # when two members share a value (the duplicate silently becomes an + # alias of the first). ``__members__`` includes aliases; an equal + # count proves no member collided onto another's value -- e.g. a new + # FFI code landing on a synthetic ``failover_retry + N`` sentinel. + self.assertEqual( + len(qi.IngressErrorCode.__members__), + len(list(qi.IngressErrorCode)), + 'IngressErrorCode has aliased members (value collision)') + # RoleMismatch mirrors its FFI code; the synthetic Python-only codes + # sit strictly above it. + self.assertGreater( + qi.IngressErrorCode.BadDataFrame.value, + qi.IngressErrorCode.RoleMismatch.value) + self.assertGreater( + qi.IngressErrorCode.BadDataFrame.value, + qi.IngressErrorCode.ArrowIngest.value) + self.assertGreater( + qi.IngressErrorCode.Cancelled.value, + qi.IngressErrorCode.ArrowIngest.value) + # FailoverRetry mirrors the FFI ingress code (17); the synthetic + # Python-only codes sit above it and stay distinct. + self.assertGreater( + qi.IngressErrorCode.FailoverRetry.value, + qi.IngressErrorCode.ArrowIngest.value) + self.assertGreater( + qi.IngressErrorCode.FailoverWouldDuplicate.value, + qi.IngressErrorCode.FailoverRetry.value) + + def test_unsupported_dataframe_shape_error_carries_failures(self): + err = qi.UnsupportedDataFrameShapeError( + 'unsupported frame', + [{'column': 'active', 'reason': 'bool_requires_packing'}]) + + self.assertIsInstance(err, qi.IngressError) + self.assertEqual(err.code, qi.IngressErrorCode.BadDataFrame) + self.assertEqual( + err.column_failures, + ({'column': 'active', 'reason': 'bool_requires_packing'},)) + + def test_client_from_conf_rejects_non_qwp_websocket(self): + with self.assertRaisesRegex( + qi.IngressError, + 'requires a QWP/WebSocket configuration string'): + qi.Client.from_conf('tcp::addr=localhost:9009;') + + def test_client_from_conf_requires_addr(self): + with self.assertRaisesRegex( + qi.IngressError, + 'Missing "addr" parameter'): + qi.Client.from_conf('qwpws::pool_size=1;') + + def test_client_close_is_idempotent(self): + client = qi.Client.__new__(qi.Client) + client.close() + client.close() + + def test_closed_client_methods_reject(self): + client = qi.Client.__new__(qi.Client) + + with self.assertRaisesRegex( + qi.IngressError, + "__enter__\\(\\) can't be called: Client is closed"): + with client: + pass + with self.assertRaisesRegex( + qi.IngressError, + "reap_idle\\(\\) can't be called: Client is closed"): + client.reap_idle() + with self.assertRaisesRegex( + qi.IngressError, + "dataframe\\(\\) can't be called: Client is closed"): + client.dataframe([], table_name='tbl', at=qi.ServerTimestamp) + + @unittest.skipIf(pd is None, 'pandas not installed') + @unittest.skipIf(pyarrow is None, 'pyarrow not installed') + def test_client_dataframe_uses_pooled_qwp_websocket_connection(self): + df = pd.DataFrame({ + 'ts': pd.Series([ + pd.Timestamp('2024-01-01 00:00:00'), + pd.Timestamp('2024-01-01 00:00:01')], dtype='datetime64[ns]'), + 'seq': pd.Series([1, 2], dtype='int64'), + 'price': pd.Series([10.5, 11.5], dtype='float64'), + }) + + with QwpAckServer() as server: + conf = ( + f'qwpws::addr=127.0.0.1:{server.port};' + 'pool_size=1;' + 'pool_max=1;' + 'pool_reap=manual;') + client = qi.Client.from_conf(conf) + try: + for _ in range(3): + client.dataframe(df, table_name='trades', at='ts') + finally: + client.close() + + stats = server.snapshot() + + self.assertEqual(stats['errors'], []) + self.assertEqual(stats['accepted_connections'], 1) + self.assertGreaterEqual(stats['qwp1_frames'], 3) + self.assertEqual(stats['binary_frames'], stats['qwp1_frames']) + self.assertGreater(stats['binary_bytes'], 0) + + @unittest.skipIf(pd is None, 'pandas not installed') + def test_client_dataframe_rejects_timestamp_only_before_publication(self): + df = pd.DataFrame({ + 'ts': pd.Series([ + pd.Timestamp('2024-01-01 00:00:00'), + pd.Timestamp('2024-01-01 00:00:01')], + dtype='datetime64[ns]'), + }) + + with QwpAckServer() as server: + conf = ( + f'qwpws::addr=127.0.0.1:{server.port};' + 'pool_size=1;' + 'pool_max=1;' + 'pool_reap=manual;') + client = qi.Client.from_conf(conf) + try: + with self.assertRaises(qi.UnsupportedDataFrameShapeError) as cm: + client.dataframe(df, table_name='trades', at='ts') + finally: + client.close() + + stats = server.snapshot() + + self.assertEqual( + cm.exception.column_failures, + ({'column': None, + 'target': None, + 'source_code': None, + 'reason': 'v1 requires at least one non-timestamp data column.'},)) + self.assertEqual(stats['errors'], []) + self.assertEqual(stats['binary_frames'], 0) + self.assertEqual(stats['qwp1_frames'], 0) + + @unittest.skipIf(pyarrow is None, 'pyarrow not installed') + def test_client_dataframe_capsule_proactive_sync(self): + n = 140 + table = pyarrow.table({ + 'v': pyarrow.array(list(range(n)), type=pyarrow.int64()), + 'ts': pyarrow.array( + [i * 1_000_000 for i in range(n)], + type=pyarrow.timestamp('us')), + }) + + with QwpAckServer() as server: + conf = ( + f'qwpws::addr=127.0.0.1:{server.port};' + 'pool_size=1;' + 'pool_max=1;' + 'pool_reap=manual;') + qi._debug_dataframe_columnar_io_stats(enabled=True, reset=True) + try: + client = qi.Client.from_conf(conf) + try: + client.dataframe( + table, table_name='trades', at='ts', + max_rows_per_batch=1) + finally: + client.close() + stats = qi._debug_dataframe_columnar_io_stats() + finally: + qi._debug_dataframe_columnar_io_stats(enabled=False, reset=True) + + snap = server.snapshot() + + self.assertEqual(snap['errors'], []) + self.assertGreaterEqual(stats['flush_calls'], n) + self.assertGreaterEqual(stats['sync_calls'], 2) + self.assertGreaterEqual(snap['binary_frames'], n) + + @unittest.skipIf(pd is None, 'pandas not installed') + def test_client_close_waits_for_active_dataframe(self): + df = pd.DataFrame({ + 'ts': pd.Series([ + pd.Timestamp('2024-01-01 00:00:00'), + pd.Timestamp('2024-01-01 00:00:01')], + dtype='datetime64[ns]'), + 'seq': pd.Series([1, 2], dtype='int64'), + }) + + with QwpAckServer(ack_delay_s=0.2) as server: + conf = ( + f'qwpws::addr=127.0.0.1:{server.port};' + 'pool_size=1;' + 'pool_max=1;' + 'pool_reap=manual;') + client = qi.Client.from_conf(conf) + errors = [] + + def ingest(): + try: + client.dataframe(df, table_name='trades', at='ts') + except Exception as exc: + errors.append(exc) + + thread = threading.Thread(target=ingest) + thread.start() + deadline = time.monotonic() + 2 + while (server.snapshot()['binary_frames'] == 0 and + time.monotonic() < deadline): + time.sleep(0.01) + self.assertGreater(server.snapshot()['binary_frames'], 0) + self.assertTrue(thread.is_alive()) + + close_started = time.monotonic() + client.close() + close_elapsed = time.monotonic() - close_started + thread.join(timeout=1) + + self.assertFalse(thread.is_alive()) + self.assertEqual(errors, []) + self.assertGreater(close_elapsed, 0.05) + with self.assertRaisesRegex( + qi.IngressError, + "reap_idle\\(\\) can't be called: Client is closed"): + client.reap_idle() + + @unittest.skipIf(pd is None or pyarrow is None, + 'pandas + pyarrow not installed') + def test_client_dataframe_syncs_before_returning_after_late_flush_error(self): + labels = ['a'] * 64000 + labels.append('x' * 1_200_000) + df = pd.DataFrame({ + 'ts': pd.date_range( + '2024-01-01 00:00:00', + periods=len(labels), + freq='ns'), + 'label': pd.Series( + pyarrow.array(labels, type=pyarrow.string()), + dtype='string[pyarrow]'), + }) + + with QwpAckServer() as server: + conf = ( + f'qwpws::addr=127.0.0.1:{server.port};' + 'pool_size=1;' + 'pool_max=1;' + 'pool_reap=manual;' + 'max_buf_size=1000000;') + client = qi.Client.from_conf(conf) + try: + with self.assertRaisesRegex( + qi.IngressError, + 'exceeds max_buf_size'): + client.dataframe(df, table_name='trades', at='ts') + finally: + client.close() + + stats = server.snapshot() + + self.assertEqual(stats['errors'], []) + # Three data chunks fit before the oversized final chunk fails + # locally. Error cleanup then emits one sync frame so those deferred + # chunks are committed before dataframe() returns. + self.assertEqual(stats['qwp1_frames'], 4) + + @unittest.skipIf(pd is None, 'pandas not installed') + @unittest.skipIf(pyarrow is None, 'pyarrow not installed') + def test_real_benchmark_paths_use_qwp_websocket_ack_flow(self): + from benchmark_pandas_columnar import ( + make_numeric_core, + run_real_client_path, + run_real_row_path) + + df = make_numeric_core(2) + + with QwpAckServer() as server: + conf = ( + f'qwpws::addr=127.0.0.1:{server.port};' + 'pool_size=1;' + 'pool_max=1;' + 'pool_reap=manual;') + _samples, _cpu_samples, last = run_real_client_path( + df, + 2, + 1, + 1, + conf=conf, + table_name='trades') + client_stats = server.snapshot() + + self.assertEqual(last['path'], 'real-client') + self.assertNotIn('manual_chunk_plan', last) + self.assertNotIn('manual_chunk_plan_error', last) + self.assertEqual(last['rows_ingested'], 2) + self.assertFalse(last['columnar_io_stats']['enabled']) + self.assertGreaterEqual(last['columnar_io_stats']['flush_calls'], 1) + self.assertEqual(last['columnar_io_stats']['sync_calls'], 1) + self.assertGreaterEqual(last['columnar_io_stats']['flush_s'], 0.0) + self.assertGreaterEqual(last['columnar_io_stats']['sync_s'], 0.0) + self.assertEqual(client_stats['errors'], []) + self.assertEqual(client_stats['accepted_connections'], 1) + self.assertGreaterEqual(client_stats['qwp1_frames'], 2) + + with QwpAckServer() as server: + conf = ( + f'qwpws::addr=127.0.0.1:{server.port};' + 'pool_size=1;' + 'pool_max=1;' + 'pool_reap=manual;') + _samples, _cpu_samples, last = run_real_row_path( + df, + 2, + 1, + 1, + conf=conf, + table_name='trades', + await_ack_ms=5000) + row_stats = server.snapshot() + + self.assertEqual(last['path'], 'real-row') + self.assertTrue(last['acked']) + self.assertEqual(last['rows_ingested'], 2) + self.assertNotIn('pool_size', last['conf']) + self.assertNotIn('pool_max', last['conf']) + self.assertNotIn('pool_reap', last['conf']) + self.assertEqual(row_stats['errors'], []) + self.assertEqual(row_stats['accepted_connections'], 1) + self.assertGreaterEqual(row_stats['qwp1_frames'], 1) + + @unittest.skipIf(pd is None, 'pandas not installed') + def test_benchmark_schema_sql_report_uses_schema_table(self): + from benchmark_pandas_columnar import schema_sql_report + + report = schema_sql_report('numeric-core') + + self.assertEqual(report['schema'], 'numeric-core') + self.assertEqual(report['table_name'], 'bench_numeric_core') + self.assertEqual( + report['drop_sql'], + 'DROP TABLE IF EXISTS bench_numeric_core') + self.assertIn( + 'CREATE TABLE bench_numeric_core', + report['create_sql']) + self.assertIn('seq LONG', report['create_sql']) + self.assertIn('ts TIMESTAMP', report['create_sql']) + self.assertEqual( + report['truncate_sql'], + 'TRUNCATE TABLE bench_numeric_core') + + def test_from_conf_preserves_qwpws_progress(self): + sender = qi.Sender.from_conf( + 'qwpws::addr=localhost:9000;qwp_ws_progress=manual;') + try: + with self.assertRaisesRegex( + qi.IngressError, + r'drive_once\(\) can\'t be called: Sender is closed'): + sender.drive_once() + finally: + sender.close(False) + + def test_from_conf_preserves_c_only_qwpws_keys(self): + with self.assertRaisesRegex( + qi.IngressError, + 'invalid sf_max_bytes'): + qi.Sender.from_conf('qwpws::addr=localhost:9000;sf_max_bytes=64mi;') + + def test_from_conf_accepts_qwpwss_tls_roots_password(self): + with tempfile.NamedTemporaryFile() as roots: + sender = qi.Sender.from_conf( + 'qwpwss::addr=localhost:9000;' + f'tls_roots={roots.name};' + 'tls_roots_password=secret;') + try: + self.assertIsInstance(sender, qi.Sender) + finally: + sender.close(False) + + def test_tls_roots_password_rejects_non_qwp_websocket(self): + with tempfile.NamedTemporaryFile() as roots: + with self.assertRaisesRegex( + qi.IngressError, + 'only supported for QWP/WebSocket'): + qi.Sender.from_conf( + 'tcps::addr=localhost:9009;' + f'tls_roots={roots.name};' + 'tls_roots_password=secret;') + + def test_from_conf_preserves_http_retry_max_backoff(self): + with self.assertRaisesRegex( + qi.IngressError, + 'retry_max_backoff_millis.*at least 10'): + qi.Sender.from_conf( + 'http::addr=localhost:9000;retry_max_backoff_millis=3;') + + def test_retry_max_backoff_rejects_non_http_protocol(self): + with self.assertRaisesRegex( + qi.IngressError, + 'retry_max_backoff_millis is supported only in ILP over HTTP'): + qi.Sender( + qi.Protocol.Tcp, + '127.0.0.1', + 9009, + retry_max_backoff=250) + + def test_duration_options_reject_bool(self): + cases = { + 'auth_timeout': '"auth_timeout" must be an int or a timedelta', + 'retry_timeout': '"retry_timeout" must be an int or a timedelta', + 'retry_max_backoff': ( + '"retry_max_backoff" must be an int or a timedelta'), + 'request_timeout': ( + '"request_timeout" must be an int or a timedelta'), + } + for option, message in cases.items(): + for value in (False, True): + with self.subTest(option=option, value=value): + with self.assertRaisesRegex(TypeError, message): + qi.Sender( + qi.Protocol.Http, + '127.0.0.1', + 9000, + **{option: value}) + + def test_from_conf_preserves_escaped_semicolon_in_c_only_qwpws_key(self): + sender = qi.Sender.from_conf( + 'qwpws::addr=localhost:9000;sf_dir=/tmp/qdb;;sf;') + try: + self.assertIsInstance(sender, qi.Sender) + finally: + sender.close(False) + + def test_qwpws_progress_rejects_non_websocket_protocol(self): + with self.assertRaisesRegex( + qi.IngressError, + 'only supported for QWP/WebSocket'): + qi.Sender( + qi.Protocol.QwpUdp, + '127.0.0.1', + 9009, + qwp_ws_progress=qi.QwpWsProgress.Manual) + + def test_qwpws_error_handler_can_be_registered(self): + sender = qi.Sender( + qi.Protocol.QwpWs, + '127.0.0.1', + 9000, + qwp_ws_error_handler=lambda error: None) + try: + self.assertIsInstance(sender, qi.Sender) + finally: + sender.close(False) + + def test_qwpws_error_handler_rejects_non_websocket_protocol(self): + with self.assertRaisesRegex( + qi.IngressError, + 'only supported for QWP/WebSocket'): + qi.Sender( + qi.Protocol.QwpUdp, + '127.0.0.1', + 9009, + qwp_ws_error_handler=lambda error: None) + + def test_qwpws_fsn_helpers_reject_non_websocket_sender_even_when_empty(self): + sender = qi.Sender( + qi.Protocol.QwpUdp, + '127.0.0.1', + 9009) + try: + sender.establish() + with self.assertRaisesRegex( + qi.IngressError, + 'only supported for QWP/WebSocket'): + sender.flush_and_get_fsn() + with self.assertRaisesRegex( + qi.IngressError, + 'only supported for QWP/WebSocket'): + sender.flush_and_keep_and_get_fsn() + finally: + sender.close(False) + + def test_qwpws_progress_conf_override_conflict(self): + with self.assertRaisesRegex(ValueError, '"qwp_ws_progress" is already present'): + qi.Sender.from_conf( + 'qwpws::addr=localhost:9000;qwp_ws_progress=manual;', + qwp_ws_progress=qi.QwpWsProgress.Background) + + class TestBases: """ Dummy class that's only used so that we can create subclasses of testcases. @@ -772,8 +1352,7 @@ def test_transaction_over_tcp(self): server.accept() self.assertRaisesRegex( qi.IngressError, - ('Transactions aren\'t supported for ILP/TCP,' + - ' use ILP/HTTP instead.'), + 'Transactions are only supported for ILP/HTTP.', sender.transaction, 'table_name') def test_transaction_basic(self): @@ -1396,8 +1975,10 @@ def encode_int_or_off(v): 'tls_verify': lambda v: 'on' if v else 'unsafe_off', 'tls_ca': str, 'tls_roots': str, + 'tls_roots_password': str, 'max_buf_size': str, 'retry_timeout': encode_duration, + 'retry_max_backoff': encode_duration, 'request_min_throughput': str, 'request_timeout': encode_duration, 'auto_flush': lambda v: 'on' if v else 'off', @@ -1413,8 +1994,13 @@ def encode(k, v): encoder = encoders.get(k, str) return encoder(v) + def conf_key(k): + if k == 'retry_max_backoff': + return 'retry_max_backoff_millis' + return k + return f'{protocol.tag}::addr={host}:{port};' + ''.join( - f'{k}={encode(k, v)};' + f'{conf_key(k)}={encode(k, v)};' for k, v in kwargs.items() if v is not None) @@ -1484,6 +2070,79 @@ class TestBufferProtocolVersionV3(TestBases.TestBuffer): version = 3 +class TestUninitializedBuffer(unittest.TestCase): + """Verify that Buffer.__new__(Buffer) raises instead of segfaulting.""" + + def _make_uninit(self): + return qi.Buffer.__new__(qi.Buffer) + + def test_len(self): + with self.assertRaisesRegex(qi.IngressError, 'Buffer is not initialized'): + len(self._make_uninit()) + + def test_bytes(self): + with self.assertRaisesRegex(qi.IngressError, 'Buffer is not initialized'): + bytes(self._make_uninit()) + + def test_capacity(self): + with self.assertRaisesRegex(qi.IngressError, 'Buffer is not initialized'): + self._make_uninit().capacity() + + def test_clear(self): + with self.assertRaisesRegex(qi.IngressError, 'Buffer is not initialized'): + self._make_uninit().clear() + + def test_reserve(self): + with self.assertRaisesRegex(qi.IngressError, 'Buffer is not initialized'): + self._make_uninit().reserve(1) + + def test_row(self): + with self.assertRaisesRegex(qi.IngressError, 'Buffer is not initialized'): + self._make_uninit().row('t', columns={'x': 1}, at=qi.ServerTimestamp) + + @unittest.skipIf(not pd, 'pandas not installed') + def test_dataframe(self): + import pandas as _pd + with self.assertRaisesRegex(qi.IngressError, 'Buffer is not initialized'): + self._make_uninit().dataframe( + _pd.DataFrame({'x': [1]}), + table_name='t', + at=qi.ServerTimestamp) + + def test_flush(self): + with self.assertRaisesRegex(qi.IngressError, 'Buffer is not initialized'): + with Server() as server, \ + qi.Sender(qi.Protocol.Tcp, '127.0.0.1', server.port) as sender: + server.accept() + sender.flush(self._make_uninit()) + + +class TestBufferFactory(unittest.TestCase): + def test_direct_construction_deprecated(self): + with self.assertWarns(DeprecationWarning): + buf = qi.Buffer(2) + self.assertEqual(len(buf), 0) + + def test_ilp_factory(self): + buf = qi.Buffer.ilp() + self.assertEqual(len(buf), 0) + buf.row('tbl', columns={'x': 1}, at=qi.ServerTimestamp) + self.assertGreater(len(bytes(buf)), 0) + + def test_ilp_invalid_version(self): + for bad in (0, 4): + with self.assertRaises(qi.IngressError) as cm: + qi.Buffer.ilp(bad) + self.assertEqual( + cm.exception.code, qi.IngressErrorCode.ProtocolVersionError) + + def test_qwp_factory(self): + buf = qi.Buffer.qwp() + self.assertIsInstance(buf, qi.Buffer) + self.assertEqual(len(buf), 0) + self.assertGreater(buf.capacity(), 0) + + if __name__ == '__main__': if os.environ.get('TEST_QUESTDB_PROFILE') == '1': import cProfile diff --git a/test/test_client_capsule_path.py b/test/test_client_capsule_path.py new file mode 100644 index 00000000..fd31aa87 --- /dev/null +++ b/test/test_client_capsule_path.py @@ -0,0 +1,601 @@ +#!/usr/bin/env python3 +"""Smoke tests for the Arrow PyCapsule Interface dispatch path +(`__arrow_c_stream__`) used by polars / pyarrow / generic Arrow-native +DataFrame inputs to `Client.dataframe()`. +""" + +import sys +sys.dont_write_bytecode = True +import datetime +import os +import unittest + +import patch_path + +PROJ_ROOT = patch_path.PROJ_ROOT +sys.path.append(str(PROJ_ROOT / 'c-questdb-client' / 'system_test')) + +import questdb.ingress as qi +from qwp_ws_ack_server import QwpAckServer + +try: + import polars as pl +except ImportError: + pl = None + +try: + import pyarrow as pa +except ImportError: + pa = None + + +def _client_conf(port): + return ( + f'qwpws::addr=127.0.0.1:{port};' + 'pool_size=1;' + 'pool_max=1;' + 'pool_reap=manual;') + + +def _ts_us(year, month, day, hour=0, minute=0, second=0): + return int(datetime.datetime( + year, month, day, hour, minute, second, + tzinfo=datetime.timezone.utc).timestamp() * 1_000_000) + + +class TestCapsulePathPyArrow(unittest.TestCase): + + @unittest.skipIf(pa is None, 'pyarrow not installed') + def test_pyarrow_table_designated_ts_column(self): + schema = pa.schema([ + pa.field('symbol', pa.string()), + pa.field('price', pa.float64()), + pa.field('ts', pa.timestamp('us')), + ]) + table = pa.Table.from_pydict({ + 'symbol': ['ETH-USD', 'BTC-USD', 'ETH-USD'], + 'price': [2615.54, 67234.12, 2620.88], + 'ts': [_ts_us(2025, 1, 1, 12, 0, 0), + _ts_us(2025, 1, 1, 12, 0, 1), + _ts_us(2025, 1, 1, 12, 0, 2)], + }, schema=schema) + with QwpAckServer() as server: + client = qi.Client.from_conf(_client_conf(server.port)) + try: + client.dataframe(table, table_name='trades', at='ts') + finally: + client.close() + stats = server.snapshot() + self.assertEqual(stats['errors'], []) + self.assertEqual(stats['accepted_connections'], 1) + self.assertGreaterEqual(stats['qwp1_frames'], 1) + + @unittest.skipIf(pa is None, 'pyarrow not installed') + def test_pyarrow_record_batch_via_table_from_batches(self): + schema = pa.schema([ + pa.field('seq', pa.int64()), + pa.field('ts', pa.timestamp('us')), + ]) + batch = pa.RecordBatch.from_pydict({ + 'seq': [1, 2], + 'ts': [_ts_us(2025, 1, 1), _ts_us(2025, 1, 2)], + }, schema=schema) + table = pa.Table.from_batches([batch]) + with QwpAckServer() as server: + client = qi.Client.from_conf(_client_conf(server.port)) + try: + client.dataframe(table, table_name='seq_log', at='ts') + finally: + client.close() + stats = server.snapshot() + self.assertEqual(stats['errors'], []) + + @unittest.skipIf(pa is None, 'pyarrow not installed') + def test_pyarrow_max_rows_per_batch_splits(self): + n = 64 + schema = pa.schema([ + pa.field('v', pa.int64()), + pa.field('ts', pa.timestamp('us')), + ]) + table = pa.Table.from_pydict({ + 'v': list(range(n)), + 'ts': [_ts_us(2025, 1, 1) + i for i in range(n)], + }, schema=schema) + with QwpAckServer() as server: + client = qi.Client.from_conf(_client_conf(server.port)) + try: + client.dataframe( + table, + table_name='split', + at='ts', + max_rows_per_batch=16) + finally: + client.close() + stats = server.snapshot() + self.assertEqual(stats['errors'], []) + self.assertGreaterEqual(stats['qwp1_frames'], 4) + + +class TestCapsulePathPolars(unittest.TestCase): + + @unittest.skipIf(pl is None, 'polars not installed') + def test_polars_dataframe_designated_ts(self): + df = pl.DataFrame({ + 'symbol': ['ETH-USD', 'BTC-USD'], + 'price': [2615.54, 67234.12], + 'ts': [ + datetime.datetime(2025, 1, 1, 12, 0, 0), + datetime.datetime(2025, 1, 1, 12, 0, 1), + ], + }, schema={ + 'symbol': pl.Utf8, + 'price': pl.Float64, + 'ts': pl.Datetime('us'), + }) + with QwpAckServer() as server: + client = qi.Client.from_conf(_client_conf(server.port)) + try: + client.dataframe(df, table_name='trades', at='ts') + finally: + client.close() + stats = server.snapshot() + self.assertEqual(stats['errors'], []) + + @unittest.skipIf(pl is None, 'polars not installed') + def test_polars_lazyframe_is_collected(self): + lf = pl.LazyFrame({ + 'v': [1, 2, 3], + 'ts': [ + datetime.datetime(2025, 1, 1), + datetime.datetime(2025, 1, 2), + datetime.datetime(2025, 1, 3), + ], + }, schema={'v': pl.Int64, 'ts': pl.Datetime('us')}) + with QwpAckServer() as server: + client = qi.Client.from_conf(_client_conf(server.port)) + try: + client.dataframe(lf, table_name='lazy_t', at='ts') + finally: + client.close() + stats = server.snapshot() + self.assertEqual(stats['errors'], []) + + +class TestSchemaOverrides(unittest.TestCase): + + @unittest.skipIf(pl is None, 'polars not installed') + def test_polars_schema_overrides_ipv4_no_pyarrow(self): + if qi._debug_dataframe_pyarrow_loaded(): + self.skipTest( + 'an earlier test already lazy-loaded pyarrow inside this ' + 'process; the no-pyarrow assertion only holds when this ' + 'test runs first.') + df = pl.DataFrame({ + 'addr': pl.Series('addr', [0x0A000001, 0xC0A80101], + dtype=pl.UInt32), + 'ts': pl.Series('ts', [ + datetime.datetime(2025, 1, 1, tzinfo=datetime.timezone.utc), + datetime.datetime(2025, 1, 2, tzinfo=datetime.timezone.utc), + ], dtype=pl.Datetime('us', time_zone='UTC')), + }) + with QwpAckServer() as server: + client = qi.Client.from_conf(_client_conf(server.port)) + try: + client.dataframe( + df, + table_name='polars_ipv4', + at='ts', + schema_overrides={'addr': 'ipv4'}) + finally: + client.close() + stats = server.snapshot() + self.assertEqual(stats['errors'], []) + self.assertFalse( + qi._debug_dataframe_pyarrow_loaded(), + 'polars + schema_overrides path must not lazy-import pyarrow') + + @unittest.skipIf(pl is None, 'polars not installed') + def test_polars_schema_overrides_symbol_no_pyarrow(self): + if qi._debug_dataframe_pyarrow_loaded(): + self.skipTest( + 'an earlier test already lazy-loaded pyarrow inside this ' + 'process; the no-pyarrow assertion only holds when this ' + 'test runs first.') + df = pl.DataFrame({ + 'region': pl.Series('region', ['us-east', 'us-west', 'us-east']), + 'price': pl.Series('price', [1.0, 2.0, 3.0], dtype=pl.Float64), + 'ts': pl.Series('ts', [ + datetime.datetime(2025, 1, 1, tzinfo=datetime.timezone.utc), + datetime.datetime(2025, 1, 2, tzinfo=datetime.timezone.utc), + datetime.datetime(2025, 1, 3, tzinfo=datetime.timezone.utc), + ], dtype=pl.Datetime('us', time_zone='UTC')), + }) + with QwpAckServer() as server: + client = qi.Client.from_conf(_client_conf(server.port)) + try: + client.dataframe( + df, + table_name='polars_symbol', + at='ts', + schema_overrides={'region': 'symbol'}) + finally: + client.close() + stats = server.snapshot() + self.assertEqual(stats['errors'], []) + self.assertFalse( + qi._debug_dataframe_pyarrow_loaded(), + 'polars + schema_overrides path must not lazy-import pyarrow') + + def test_schema_overrides_rejects_decimal_kind(self): + with self.assertRaisesRegex(ValueError, 'kind'): + with QwpAckServer() as server: + client = qi.Client.from_conf(_client_conf(server.port)) + try: + client.dataframe( + object(), + table_name='t', + at='ts', + schema_overrides={'x': ('decimal', 2)}) + finally: + client.close() + + @unittest.skipIf(pa is None, 'pyarrow not installed') + def test_schema_overrides_ipv4(self): + schema = pa.schema([ + pa.field('addr', pa.uint32()), + pa.field('ts', pa.timestamp('us')), + ]) + table = pa.Table.from_pydict({ + 'addr': [0x0A000001, 0xC0A80101], + 'ts': [_ts_us(2025, 1, 1), _ts_us(2025, 1, 2)], + }, schema=schema) + with QwpAckServer() as server: + client = qi.Client.from_conf(_client_conf(server.port)) + try: + client.dataframe( + table, + table_name='ipv4_log', + at='ts', + schema_overrides={'addr': 'ipv4'}) + finally: + client.close() + stats = server.snapshot() + self.assertEqual(stats['errors'], []) + + @unittest.skipIf(pa is None, 'pyarrow not installed') + def test_schema_overrides_rejects_unknown_kind(self): + schema = pa.schema([ + pa.field('x', pa.int64()), + pa.field('ts', pa.timestamp('us')), + ]) + table = pa.Table.from_pydict({ + 'x': [1, 2], 'ts': [_ts_us(2025, 1, 1), _ts_us(2025, 1, 2)], + }, schema=schema) + with QwpAckServer() as server: + client = qi.Client.from_conf(_client_conf(server.port)) + try: + with self.assertRaisesRegex(ValueError, 'kind'): + client.dataframe( + table, + table_name='t', + at='ts', + schema_overrides={'x': 'bogus'}) + finally: + client.close() + + @unittest.skipIf(pa is None, 'pyarrow not installed') + def test_schema_overrides_rejects_bad_geohash_bits(self): + schema = pa.schema([ + pa.field('loc', pa.int32()), + pa.field('ts', pa.timestamp('us')), + ]) + table = pa.Table.from_pydict({ + 'loc': [1, 2], 'ts': [_ts_us(2025, 1, 1), _ts_us(2025, 1, 2)], + }, schema=schema) + with QwpAckServer() as server: + client = qi.Client.from_conf(_client_conf(server.port)) + try: + with self.assertRaisesRegex(ValueError, 'geohash bits'): + client.dataframe( + table, + table_name='t', + at='ts', + schema_overrides={'loc': ('geohash', 0)}) + finally: + client.close() + + +class TestPyArrowRecordBatchDirect(unittest.TestCase): + + @unittest.skipIf(pa is None, 'pyarrow not installed') + def test_bare_record_batch_routes_via_table_wrap(self): + schema = pa.schema([ + pa.field('v', pa.int64()), + pa.field('ts', pa.timestamp('us')), + ]) + batch = pa.RecordBatch.from_pydict({ + 'v': [1, 2, 3], + 'ts': [_ts_us(2025, 1, 1) + i for i in range(3)], + }, schema=schema) + with QwpAckServer() as server: + client = qi.Client.from_conf(_client_conf(server.port)) + try: + client.dataframe(batch, table_name='from_rb', at='ts') + finally: + client.close() + stats = server.snapshot() + self.assertEqual(stats['errors'], []) + self.assertGreaterEqual(stats['qwp1_frames'], 1) + + +class TestSchemaOverridesPandas(unittest.TestCase): + + @unittest.skipIf(pa is None or pl is None, 'pyarrow + polars required') + def test_pandas_dataframe_with_schema_overrides_ipv4(self): + import pandas as pd + df = pd.DataFrame({ + 'addr': pd.Series([0x0A000001, 0xC0A80101], dtype='uint32'), + 'ts': pd.Series( + pa.array([_ts_us(2025, 1, 1), _ts_us(2025, 1, 2)], + type=pa.timestamp('us')), + dtype=pd.ArrowDtype(pa.timestamp('us'))), + }) + with QwpAckServer() as server: + client = qi.Client.from_conf(_client_conf(server.port)) + try: + client.dataframe( + df, + table_name='ipv4_pandas', + at='ts', + schema_overrides={'addr': 'ipv4'}) + finally: + client.close() + stats = server.snapshot() + self.assertEqual(stats['errors'], []) + + +class TestBenchFlushArrowBatch(unittest.TestCase): + """Regression coverage equivalent to the old + `_bench_dataframe_append_arrow_buffer` tests, migrated to the new + `column_sender_flush_arrow_batch` path.""" + + @unittest.skipIf(pa is None, 'pyarrow not installed') + def test_uses_rust_classifier_accepts_uint_and_f16(self): + import pandas as pd + import numpy as np + ts_type = pa.timestamp('ms', tz='UTC') + df = pd.DataFrame({ + 'ts': pd.Series( + pa.array( + [1704067200000, 1704067201000, 1704067202000], + type=ts_type), + dtype=pd.ArrowDtype(ts_type)), + 'u8': pd.Series( + pa.array([1, 2, None], type=pa.uint8()), + dtype=pd.ArrowDtype(pa.uint8())), + 'u16': pd.Series( + pa.array([1000, None, 3000], type=pa.uint16()), + dtype=pd.ArrowDtype(pa.uint16())), + 'u64': pd.Series( + pa.array([1, 2 ** 63 - 1, None], type=pa.uint64()), + dtype=pd.ArrowDtype(pa.uint64())), + 'f16': pd.Series( + pa.array(np.array([1.5, 2.5, 3.5], dtype=np.float16), + type=pa.float16()), + dtype=pd.ArrowDtype(pa.float16())), + }) + batch = pa.RecordBatch.from_pandas(df, preserve_index=False) + with QwpAckServer() as server: + result = qi._bench_dataframe_flush_arrow_batch( + batch, + table_name='trades', + at='ts', + conf=_client_conf(server.port), + iterations=2) + self.assertEqual(result['iterations'], 2) + self.assertEqual(result['row_count'], 3) + self.assertEqual(result['col_count'], 5) + self.assertEqual(result['completed'], 2) + + @unittest.skipIf(pa is None, 'pyarrow not installed') + def test_rejects_uint64_above_i64_max(self): + import pandas as pd + ts_type = pa.timestamp('ms', tz='UTC') + df = pd.DataFrame({ + 'ts': pd.Series( + pa.array([1704067200000, 1704067201000], type=ts_type), + dtype=pd.ArrowDtype(ts_type)), + 'u64': pd.Series( + pa.array([1, 2 ** 63], type=pa.uint64()), + dtype=pd.ArrowDtype(pa.uint64())), + }) + batch = pa.RecordBatch.from_pandas(df, preserve_index=False) + with QwpAckServer() as server: + with self.assertRaisesRegex( + qi.IngressError, + r'UInt64 value 9223372036854775808 .* does not fit QuestDB LONG'): + qi._bench_dataframe_flush_arrow_batch( + batch, + table_name='trades', + at='ts', + conf=_client_conf(server.port), + iterations=1) + + +class TestCapsulePathPolarsMissing(unittest.TestCase): + + def test_non_polars_non_arrow_falls_through(self): + """A bare object without `__arrow_c_stream__` and not polars / not + pandas falls through the capsule + pyarrow paths and raises. + """ + with QwpAckServer() as server: + client = qi.Client.from_conf(_client_conf(server.port)) + try: + with self.assertRaises((TypeError, qi.IngressError)): + client.dataframe(object(), table_name='t', at=None) + finally: + client.close() + + +class TestWriterMixingInOneChunk(unittest.TestCase): + """Plan Q3: confirm a pandas DataFrame containing simultaneously a + pyobj-sniffed column (string), an Arrow-backed narrow integer + (i8_arrow), and a numpy-direct column (int64) all coexist in one + `column_sender_chunk` and produce a valid wire frame.""" + + @unittest.skipIf(pa is None, 'pyarrow not installed') + def test_pyobj_str_arrow_i8_numpy_i64_mix(self): + import pandas as pd + import numpy as np + df = pd.DataFrame({ + 'name': pd.Series(['alpha', 'beta', None], dtype='object'), + 'rank': pd.Series( + pa.array([1, -1, 7], type=pa.int8()), + dtype=pd.ArrowDtype(pa.int8())), + 'qty': pd.Series([100, 200, 300], dtype='int64'), + 'ts': pd.Series([ + pd.Timestamp('2025-01-01 00:00:00'), + pd.Timestamp('2025-01-01 00:00:01'), + pd.Timestamp('2025-01-01 00:00:02')], + dtype='datetime64[ns]'), + }) + with QwpAckServer() as server: + client = qi.Client.from_conf(_client_conf(server.port)) + try: + client.dataframe(df, table_name='mixed', at='ts') + finally: + client.close() + stats = server.snapshot() + self.assertEqual(stats['errors'], []) + self.assertGreaterEqual(stats['qwp1_frames'], 1) + + +class TestPandasPlannerRouting(unittest.TestCase): + """Pandas object columns use the manual dataframe planner; Arrow-backed + pandas columns can use the capsule route.""" + + @unittest.skipIf(pa is None, 'pyarrow not installed') + def test_arrow_backed_pandas_uses_capsule_without_overrides(self): + import numpy as np + import pandas as pd + ts_type = pa.timestamp('ms', tz='UTC') + df = pd.DataFrame({ + 'ts': pd.Series( + pa.array( + [1704067200000, 1704067201000, 1704067202000], + type=ts_type), + dtype=pd.ArrowDtype(ts_type)), + 'u64': pd.Series( + pa.array([1, 2 ** 63 - 1, None], type=pa.uint64()), + dtype=pd.ArrowDtype(pa.uint64())), + 'f16': pd.Series( + pa.array(np.array([1.5, 2.5, 3.5], dtype=np.float16), + type=pa.float16()), + dtype=pd.ArrowDtype(pa.float16())), + }) + with QwpAckServer() as server: + client = qi.Client.from_conf(_client_conf(server.port)) + try: + client.dataframe(df, table_name='arrow_pandas', at='ts') + finally: + client.close() + stats = server.snapshot() + self.assertEqual(stats['errors'], []) + self.assertEqual(stats['accepted_connections'], 1) + self.assertGreaterEqual(stats['qwp1_frames'], 1) + + @unittest.skipIf(pa is None, 'pyarrow not installed') + def test_arrow_backed_pandas_symbol_override_uses_capsule(self): + import pandas as pd + ts_type = pa.timestamp('us', tz='UTC') + df = pd.DataFrame({ + 'ts': pd.Series( + pa.array([1704067200000000, 1704067201000000], + type=ts_type), + dtype=pd.ArrowDtype(ts_type)), + 'region': pd.array( + pa.array(['us-east', 'us-west'], type=pa.string()), + dtype=pd.ArrowDtype(pa.string())), + 'v': pd.Series( + pa.array([1, 2], type=pa.int64()), + dtype=pd.ArrowDtype(pa.int64())), + }) + with QwpAckServer() as server: + client = qi.Client.from_conf(_client_conf(server.port)) + try: + client.dataframe( + df, table_name='arrow_pandas_symbols', + at='ts', symbols=['region']) + finally: + client.close() + stats = server.snapshot() + self.assertEqual(stats['errors'], []) + self.assertEqual(stats['accepted_connections'], 1) + self.assertGreaterEqual(stats['qwp1_frames'], 1) + + @unittest.skipIf(pa is None, 'pyarrow not installed') + def test_mixed_arrow_numpy_symbol_override_uses_manual(self): + # A numpy column routes the whole frame to the manual planner; the + # Arrow string column overridden to SYMBOL is ingested via the + # arrow-import symbol path (force_symbol). + import pandas as pd + ts_type = pa.timestamp('us', tz='UTC') + df = pd.DataFrame({ + 'ts': pd.Series( + pa.array([1704067200000000, 1704067201000000], + type=ts_type), + dtype=pd.ArrowDtype(ts_type)), + 'region': pd.array( + pa.array(['us-east', 'us-west'], type=pa.string()), + dtype=pd.ArrowDtype(pa.string())), + 'v': pd.Series([1, 2], dtype='int64'), + }) + with QwpAckServer() as server: + client = qi.Client.from_conf(_client_conf(server.port)) + try: + client.dataframe( + df, table_name='mixed_arrow_symbols', + at='ts', symbols=['region']) + finally: + client.close() + stats = server.snapshot() + self.assertEqual(stats['errors'], []) + self.assertEqual(stats['accepted_connections'], 1) + self.assertGreaterEqual(stats['qwp1_frames'], 1) + + @unittest.skipIf(pa is None, 'pyarrow not installed') + def test_pyobj_str_bad_cell_fails_before_borrowing_conn(self): + import pandas as pd + df_bad = pd.DataFrame({ + 'name': pd.Series(['alpha', 12345, None], dtype='object'), + 'ts': pd.Series([ + pd.Timestamp('2025-01-01 00:00:00'), + pd.Timestamp('2025-01-01 00:00:01'), + pd.Timestamp('2025-01-01 00:00:02')], + dtype='datetime64[ns]'), + }) + df_good = pd.DataFrame({ + 'name': pd.Series(['x', 'y'], dtype='object'), + 'ts': pd.Series([ + pd.Timestamp('2025-01-02 00:00:00'), + pd.Timestamp('2025-01-02 00:00:01')], + dtype='datetime64[ns]'), + }) + with QwpAckServer() as server: + client = qi.Client.from_conf(_client_conf(server.port)) + try: + with self.assertRaises(qi.IngressError): + client.dataframe(df_bad, table_name='t', at='ts') + client.dataframe(df_good, table_name='t', at='ts') + finally: + client.close() + stats = server.snapshot() + # The bad pandas frame fails during manual-plan validation before a + # connection is borrowed. The good frame is the only publish. + self.assertEqual(stats['errors'], []) + self.assertEqual(stats['accepted_connections'], 1) + self.assertGreaterEqual(stats['qwp1_frames'], 1) + + +if __name__ == '__main__': + unittest.main() diff --git a/test/test_client_dataframe_fuzz.py b/test/test_client_dataframe_fuzz.py new file mode 100644 index 00000000..1ae5a9a0 --- /dev/null +++ b/test/test_client_dataframe_fuzz.py @@ -0,0 +1,1385 @@ +""" +Deterministic, seed-controlled fuzz coverage for Client.dataframe(). + +Mirrors the seed-and-replay convention from +``c-questdb-client/system_test/qwp_ws_fuzz.py``: + + - Pick a master 64-bit seed at ``setUpClass``: random by default via + ``secrets.randbits(64)``, or explicit via ``QDB_CLIENT_FUZZ_SEED`` + (hex with ``0x`` prefix, or decimal). + - Print the master seed to stderr once per run so failing CI logs are + reproducible. + - Each iteration draws its own child seed from the master, so a single + failing iteration can be reproduced by setting + ``QDB_CLIENT_FUZZ_ITER_SEED=`` to run that one iteration alone. + +Every iteration drives ``Client.dataframe()`` round-trip through a local +``QwpAckServer`` fixture (no real QuestDB required) and asserts: + + - Frames the v1 planner rejects raise before any QWP/WebSocket binary frame + is published. Most shape rejections raise + ``UnsupportedDataFrameShapeError``; Arrow validation rejections surface as + ``IngressError``. + - Frames the v1 planner accepts complete without raising and produce at + least one QWP1 binary frame at the server (unless the frame is empty, + in which case ``Client.dataframe()`` is a no-op). + - The pool reuses a single TCP accept across the whole iteration loop + (``pool_size=pool_max=1``, ``pool_reap=manual``). + - The server reports no protocol-level errors at any point. + +Usage:: + + venv/bin/python -m unittest test.test_client_dataframe_fuzz + + # Reproduce a master sequence: + QDB_CLIENT_FUZZ_SEED=0xdeadbeefdeadbeef \\ + venv/bin/python -m unittest test.test_client_dataframe_fuzz + + # Run a longer sweep: + QDB_CLIENT_FUZZ_ITERS=500 \\ + venv/bin/python -m unittest test.test_client_dataframe_fuzz + + # Reproduce a single iteration: + QDB_CLIENT_FUZZ_ITER_SEED=0x... \\ + venv/bin/python -m unittest \\ + test.test_client_dataframe_fuzz.TestClientDataframeFuzz.test_fuzz_round_trip +""" + +import sys + +sys.dont_write_bytecode = True + +import datetime +import os +import pathlib +import random +import secrets +import tempfile +import unittest +import uuid + +import numpy as np + +import patch_path +patch_path.patch() + +import questdb.ingress as qi + +PROJ_ROOT = patch_path.PROJ_ROOT +sys.path.append(str(PROJ_ROOT / 'c-questdb-client' / 'system_test')) + +try: + import pandas as pd + import pyarrow as pa +except ImportError: + pd = None + pa = None + +from qwp_ws_ack_server import QwpAckServer + + +SEED_ENV = 'QDB_CLIENT_FUZZ_SEED' +ITER_SEED_ENV = 'QDB_CLIENT_FUZZ_ITER_SEED' +ITERS_ENV = 'QDB_CLIENT_FUZZ_ITERS' + + +def _sfa_conf(port, sender_id, sf_dir): + return ( + f'qwpws::addr=127.0.0.1:{port};' + f'sender_id={sender_id};' + f'sf_dir={sf_dir};' + 'pool_size=1;' + 'pool_max=1;' + 'pool_reap=manual;' + 'reconnect_max_duration_millis=30000;' + 'close_flush_timeout_millis=30000;') + + +def _sfa_file_count(sf_dir, sender_id): + slot_dir = pathlib.Path(sf_dir) / sender_id + if not slot_dir.exists(): + return 0 + return sum(1 for path in slot_dir.iterdir() + if path.name.endswith('.sfa')) + + +def _parse_int_env(name): + raw = os.environ.get(name) + if raw is None or not raw.strip(): + return None + raw = raw.strip() + if raw.lower().startswith('0x'): + return int(raw, 16) + return int(raw) + + +def _derive_master_seed(): + parsed = _parse_int_env(SEED_ENV) + if parsed is not None: + return parsed + return secrets.randbits(64) + + +def _format_seed(seed): + return f'0x{seed:016x}' + + +class Rng: + """``random.Random`` wrapper. Seed and helpers shaped for fuzz reuse.""" + + __slots__ = ('_impl', 'seed') + + def __init__(self, seed): + self.seed = seed & ((1 << 64) - 1) + self._impl = random.Random(self.seed) + + def next_int(self, bound): + if bound <= 0: + raise ValueError('bound must be positive') + return self._impl.randrange(bound) + + def next_bool(self): + return self._impl.getrandbits(1) == 1 + + def next_long(self): + return self._impl.getrandbits(64) + + def choice(self, seq): + return self._impl.choice(seq) + + def shuffle(self, seq): + self._impl.shuffle(seq) + + def sample(self, seq, k): + return self._impl.sample(list(seq), k) + + def uniform(self, lo, hi): + return self._impl.uniform(lo, hi) + + def chance(self, prob): + return self._impl.random() < prob + + +# --------------------------------------------------------------------------- +# Helpers for column data. +# --------------------------------------------------------------------------- + + +def _build_test_alphabet(): + """Multi-script alphabet for stressing UTF-8 handling in varchar / + categorical columns. Restricted to letter-ish ranges so values don't + collide with QWP wire-format reserved bytes or break the server's + UTF-8 validator on round-trip.""" + ranges = [ + (0x0041, 0x005A), # A-Z + (0x0061, 0x007A), # a-z + (0x0030, 0x0039), # 0-9 + (0x00C0, 0x00FF), # Latin-1 supplement letters + (0x0100, 0x017F), # Latin Extended-A + (0x0370, 0x03FF), # Greek + (0x0400, 0x04FF), # Cyrillic + ] + return [chr(cp) for r in ranges for cp in range(r[0], r[1] + 1)] + + +_TEST_ALPHABET = _build_test_alphabet() +_ASCII_LETTERS = [chr(c) for c in range(ord('A'), ord('Z') + 1)] + + +def _random_strings(rng, n, max_len, null_prob, *, + ascii_only=False, empty_prob=0.05): + """Generate n strings, possibly with nulls and zero-length values. + + ``ascii_only`` forces the ASCII-letter subset. ``empty_prob`` is the + chance of emitting ``''`` for a non-null slot — empty strings + exercise the zero-length offset slice in the varchar wire path.""" + pool = _ASCII_LETTERS if ascii_only else _TEST_ALPHABET + out = [] + for _ in range(n): + if null_prob > 0 and rng.chance(null_prob): + out.append(None) + continue + if empty_prob > 0 and rng.chance(empty_prob): + out.append('') + continue + length = max(1, rng.next_int(max_len)) + out.append(''.join(rng.choice(pool) for _ in range(length))) + return out + + +def _datetime_array(n, unit='ns'): + base = np.datetime64('2024-01-01T00:00:00', unit).astype('int64') + step = {'ns': 1_000_000_000, 'us': 1_000_000}[unit] + return (base + step * np.arange(n, dtype=np.int64)).astype( + f'datetime64[{unit}]') + + +# --------------------------------------------------------------------------- +# Supported field generators. Each returns a pd.Series with n_rows rows. +# --------------------------------------------------------------------------- + + +_INT64_MIN = -(1 << 63) +_INT64_MAX = (1 << 63) - 1 +_INT64_SPECIALS = ( + 0, 1, -1, + _INT64_MIN, _INT64_MIN + 1, + _INT64_MAX, _INT64_MAX - 1) + +_FLOAT64_SPECIALS = ( + 0.0, -0.0, 1.0, -1.0, + float('nan'), float('inf'), float('-inf'), + 1e-300, 1e300) + + +def _gen_int64(rng, n): + # 5% special values to exercise wire-edge cases: INT64_MIN + # (QuestDB's NULL sentinel for LONG — should still flow through + # the wire), INT64_MAX, zero, etc. + out = np.empty(n, dtype=np.int64) + for i in range(n): + if rng.chance(0.05): + out[i] = rng.choice(_INT64_SPECIALS) + else: + out[i] = int(rng.uniform(-(1 << 50), 1 << 50)) + return pd.Series(out) + + +def _gen_float64(rng, n): + # 5% IEEE-754 special values: NaN, ±Inf, ±0.0, subnormals. None of + # these should crash the wire encoder; the server may reject them + # semantically but the QwpAckServer doesn't validate value content. + out = np.empty(n, dtype=np.float64) + for i in range(n): + if rng.chance(0.05): + out[i] = rng.choice(_FLOAT64_SPECIALS) + else: + out[i] = rng.uniform(-1e6, 1e6) + return pd.Series(out) + + +def _gen_dt64ns_field(rng, n): + return pd.Series(_datetime_array(n, 'ns')) + + +def _gen_dt64us_field(rng, n): + return pd.Series(_datetime_array(n, 'us')) + + +def _gen_categorical(rng, n): + # Force a string-typed categories index. With the default constructor, + # an all-None categorical infers float64 categories — which the row + # path's argument resolver rejects with "Expected a category of + # strings". That rejection happens upstream of the columnar planner, + # so we'd see neither the v1 reject nor the v1 accept paths. + if n == 0: + return pd.Series(pd.Categorical( + [], categories=pd.Index([], dtype=object))) + cardinality = max(2, min(n, rng.next_int(16) + 2)) + # Categories must be unique. Oversample then dedup; fall back to a + # deterministic pad if the alphabet collisions leave us short. + raw_pool = _random_strings(rng, cardinality * 2, 8, 0.0) + pool = list(dict.fromkeys(raw_pool))[:cardinality] + while len(pool) < 2: + pool.append(f'_pad_{len(pool)}') + null_prob = 0.2 if rng.next_bool() else 0.0 + choices = [ + None if rng.chance(null_prob) else pool[rng.next_int(len(pool))] + for _ in range(n)] + return pd.Series(pd.Categorical( + choices, dtype=pd.CategoricalDtype(categories=pool))) + + +def _gen_string_pyarrow(rng, n): + null_prob = 0.2 if rng.next_bool() else 0.0 + items = _random_strings(rng, n, 16, null_prob) + return pd.Series(items, dtype='string[pyarrow]') + + +def _gen_large_string(rng, n): + null_prob = 0.2 if rng.next_bool() else 0.0 + items = _random_strings(rng, n, 8, null_prob) + arr = pa.array(items, type=pa.large_string()) + return pd.Series(arr, dtype=pd.ArrowDtype(pa.large_string())) + + +# (kind, generator, weight). Weights bias toward the variable-width and +# nullable types (categorical, string varieties) because those exercise +# more emitter code paths than fixed-width numerics. +SUPPORTED_FIELD_GENS_WEIGHTED = [ + ('int64', _gen_int64, 10), + ('float64', _gen_float64, 10), + ('dt64ns_field', _gen_dt64ns_field, 8), + ('dt64us_field', _gen_dt64us_field, 8), + ('categorical', _gen_categorical, 18), + ('string_pyarrow', _gen_string_pyarrow, 18), + ('large_string', _gen_large_string, 12), + # object-dtype str is appended below, once `_gen_object_str` is defined. +] + + +# --------------------------------------------------------------------------- +# Unsupported field generators. Mere presence of one should make the whole +# plan reject. +# --------------------------------------------------------------------------- + + +def _gen_int32(rng, n): + return pd.Series(np.array( + [rng.next_int(1 << 30) for _ in range(n)], dtype=np.int32)) + + +def _gen_float32(rng, n): + return pd.Series(np.array( + [rng.uniform(-1e6, 1e6) for _ in range(n)], dtype=np.float32)) + + +def _gen_bool(rng, n): + return pd.Series(np.array( + [rng.next_bool() for _ in range(n)], dtype=bool)) + + +def _gen_uint8(rng, n): + return pd.Series(np.array( + [rng.next_int(256) for _ in range(n)], dtype=np.uint8)) + + +def _gen_uint64(rng, n): + return pd.Series(np.array( + [rng.next_int(1 << 32) for _ in range(n)], dtype=np.uint64)) + + +def _gen_object_str(rng, n): + items = _random_strings(rng, n, 8, 0.0) + return pd.Series(items, dtype='object') + + +def _gen_string_python(rng, n): + # Force python-backed storage. With pyarrow installed, modern pandas + # defaults `dtype='string'` to pyarrow storage, which IS supported by + # the columnar planner — so the generator name would be misleading + # without the explicit `storage='python'` here. + items = _random_strings(rng, n, 8, 0.0) + return pd.Series(items, dtype=pd.StringDtype(storage='python')) + + +UNSUPPORTED_FIELD_GENS = [ + # Step 3 moved every narrower-dtype generator into the supported + # list below. The only remaining intentional rejection cases live + # in dedicated focused tests (e.g. table_name_col, NaT designated + # ts, non-column at), not in the fuzz frame builder. +] + + +# Step 3 added a Rust-side widening / packing appender. Every narrower +# NumPy numeric dtype + native bool is now supported via +# column_sender_chunk_append_numpy_column. +SUPPORTED_FIELD_GENS_WEIGHTED.append(('int32', _gen_int32, 8)) +SUPPORTED_FIELD_GENS_WEIGHTED.append(('float32', _gen_float32, 8)) +SUPPORTED_FIELD_GENS_WEIGHTED.append(('bool_numpy', _gen_bool, 6)) +SUPPORTED_FIELD_GENS_WEIGHTED.append(('uint8', _gen_uint8, 6)) +SUPPORTED_FIELD_GENS_WEIGHTED.append(('uint64', _gen_uint64, 6)) + + +# Object-dtype int / float / bool generators. + +def _gen_object_int(rng, n): + items = [int(rng.uniform(-(1 << 30), 1 << 30)) for _ in range(n)] + return pd.Series(items, dtype='object') + + +def _gen_object_float(rng, n): + items = [rng.uniform(-1e6, 1e6) for _ in range(n)] + return pd.Series(items, dtype='object') + + +def _gen_object_bool(rng, n): + items = [bool(rng.next_bool()) for _ in range(n)] + return pd.Series(items, dtype='object') + + +# Step 4 added PyObject support via the sniff+build path. Both +# object-dtype str and pd.StringDtype(storage='python') resolve to +# col_source_str_pyobj; int / float / bool flow through their own +# pyobj sources. +SUPPORTED_FIELD_GENS_WEIGHTED.append(('object_str', _gen_object_str, 10)) +SUPPORTED_FIELD_GENS_WEIGHTED.append(('string_python', _gen_string_python, 6)) +SUPPORTED_FIELD_GENS_WEIGHTED.append(('object_int', _gen_object_int, 8)) +SUPPORTED_FIELD_GENS_WEIGHTED.append(('object_float', _gen_object_float, 8)) +SUPPORTED_FIELD_GENS_WEIGHTED.append(('object_bool', _gen_object_bool, 6)) + + +# --------------------------------------------------------------------------- +# Designated-timestamp generators. +# --------------------------------------------------------------------------- + + +def _gen_at_dt64ns(rng, n): + return pd.Series(_datetime_array(n, 'ns')), True + + +def _gen_at_dt64us(rng, n): + return pd.Series(_datetime_array(n, 'us')), True + + +def _gen_at_dt64ns_nat(rng, n): + if n == 0: + return pd.Series(_datetime_array(0, 'ns')), True + s = pd.Series(_datetime_array(n, 'ns')).copy() + n_nat = max(1, n // 8) + idx = rng.sample(range(n), min(n_nat, n)) + s.iloc[idx] = pd.NaT + return s, False + + +def _gen_at_dt64ns_negative(rng, n): + if n == 0: + return pd.Series(_datetime_array(0, 'ns')), True + base = (-1_000_000_000) * np.arange(1, n + 1, dtype=np.int64) + return pd.Series(base.astype('datetime64[ns]')), False + + +# (generator, weight). Heavy bias toward the happy-path units so the +# fuzz mostly drives the supported flow; the planner-rejection variants +# (NaT, negative timestamp) are kept rare to leave room for column-side +# rejection cases to be the more interesting axis. Tweak weights for +# targeted reproduction by setting QDB_CLIENT_FUZZ_ITER_SEED instead. +AT_GENS_WEIGHTED = [ + (_gen_at_dt64ns, 70), + (_gen_at_dt64us, 20), + (_gen_at_dt64ns_nat, 5), + (_gen_at_dt64ns_negative, 5), +] + + +def _weighted_pick_value(rng, weighted_seq): + """Pick an item from ``[(value, weight), ...]``.""" + total = sum(w for _, w in weighted_seq) + pick = rng.next_int(total) + accum = 0 + for item, w in weighted_seq: + accum += w + if pick < accum: + return item + return weighted_seq[-1][0] + + +def _weighted_pick_kv(rng, weighted_triples): + """Pick an item from ``[(key, value, weight), ...]``.""" + total = sum(t[-1] for t in weighted_triples) + pick = rng.next_int(total) + accum = 0 + for triple in weighted_triples: + accum += triple[-1] + if pick < accum: + return triple[0], triple[1] + return weighted_triples[-1][0], weighted_triples[-1][1] + + +# Row counts deliberately chosen to hit chunk-boundary edges: +# - 0 empty df no-op +# - 1, 7 < the 8-row validity alignment floor +# - 8, 16 exact multiples of 8 +# - 9, 17 multiple-of-8 + 1 -> tail chunk +# - others a few larger sizes +ROW_COUNT_CHOICES = [0, 1, 2, 7, 8, 9, 15, 16, 17, 32, 63, 64, 100, 257] + + +# Symbols-argument variants picked per iteration. Kept as named modes +# so the categorical-routing constraints stay obvious: +# - 'auto' : every categorical is a symbol. +# - False : no symbols; categoricals fall to the string-field path +# which v1 rejects. +# - 'all' : explicit list of every categorical (equivalent to auto +# but exercises the list-symbols code path). +# - 'partial': drop one categorical from the symbol list when there +# are at least two cats present; the unlisted cat falls +# to the string-field path and the planner rejects. +SYMBOL_MODES_WEIGHTED = [ + ('auto', 6), + (False, 3), + ('all', 3), + ('partial', 2), +] + + +def _build_frame(rng): + """ + Return (df, kwargs, expected_supported). + + ``expected_supported`` describes the static v1 planner's accept/reject + decision. If True and ``len(df) == 0``, ``Client.dataframe()`` returns + early without sending; otherwise an accepted frame produces at least + one binary frame on the wire. + + Column generation and the ``symbols`` argument are kept consistent + so ``expected_supported`` actually reflects the planner's rules + (categoricals route through the symbol path only when 'auto' or + explicitly listed; an unlisted categorical falls to the string-field + path which v1 rejects). + """ + n_rows = rng.choice(ROW_COUNT_CHOICES) + + at_gen = _weighted_pick_value(rng, AT_GENS_WEIGHTED) + ts, at_ok = at_gen(rng, n_rows) + expected_supported = at_ok + + # Decide the symbols mode up front so we know whether to allow + # categorical columns at all. + sym_mode = _weighted_pick_value(rng, SYMBOL_MODES_WEIGHTED) + allow_categorical = sym_mode is not False + + cols = {'ts': ts} + + # Step 3 emptied UNSUPPORTED_FIELD_GENS — every previously-rejected + # narrow NumPy dtype is now accepted via the widening appender, and + # PyObject sources via the sniff+build path. The unsupported-column + # injection is preserved here for forward use if a new + # never-accepted dtype shows up in the future. + if UNSUPPORTED_FIELD_GENS and rng.chance(0.25): + kind, gen = _weighted_pick_kv( + rng, [(k, g, 1) for k, g in UNSUPPORTED_FIELD_GENS]) + cols[f'bad_{kind}'] = gen(rng, n_rows) + if n_rows > 0: + expected_supported = False + + gen_pool = SUPPORTED_FIELD_GENS_WEIGHTED + if not allow_categorical: + gen_pool = [(k, g, w) for k, g, w in SUPPORTED_FIELD_GENS_WEIGHTED + if k != 'categorical'] + + n_field_cols = rng.next_int(4) + 1 + cat_col_names = [] + for c in range(n_field_cols): + kind, gen = _weighted_pick_kv(rng, gen_pool) + name = f'c{c}_{kind}' + cols[name] = gen(rng, n_rows) + if kind == 'categorical': + cat_col_names.append(name) + + df = pd.DataFrame(cols) + + # Column order shouldn't affect correctness; randomise to flush out + # planner ordering bugs. + if rng.next_bool(): + order = list(df.columns) + rng.shuffle(order) + df = df[order] + + # Resolve the symbols mode into a concrete argument now that we know + # which categoricals exist. + if sym_mode == 'auto': + symbols = 'auto' + elif sym_mode is False: + symbols = False + elif sym_mode == 'all': + symbols = cat_col_names if cat_col_names else 'auto' + elif sym_mode == 'partial': + if len(cat_col_names) >= 2: + listed = list(cat_col_names) + rng.shuffle(listed) + symbols = listed[:-1] + # An unlisted categorical is not rejected: like the row/numpy + # path, it falls through to a plain VARCHAR field, so the frame + # stays supported. + else: + # No second categorical to drop -> degenerate; equivalent to + # listing all (or 'auto' when none exist). + symbols = cat_col_names if cat_col_names else 'auto' + else: + raise RuntimeError(f'unknown sym_mode={sym_mode!r}') + + kwargs = {'table_name': 'fuzz_table', 'at': 'ts', 'symbols': symbols} + return df, kwargs, expected_supported + + +# --------------------------------------------------------------------------- +# Tests. +# --------------------------------------------------------------------------- + + +@unittest.skipIf(pd is None or pa is None, 'pandas/pyarrow not installed') +class TestClientDataframeFuzz(unittest.TestCase): + """Round-trip fuzz: every iteration goes through Client.dataframe() to + a local QwpAckServer.""" + + DEFAULT_ITERS = 100 + + @classmethod + def setUpClass(cls): + cls.iter_seed_override = _parse_int_env(ITER_SEED_ENV) + if cls.iter_seed_override is not None: + # In override mode the master seed never feeds anything; + # report only the iter seed so the log isn't misleading. + cls.master_seed = None + cls.iters = 1 + sys.stderr.write( + f'>>>> Client.dataframe fuzz: ' + f'iter_seed_override={_format_seed(cls.iter_seed_override)}, ' + f'iters=1\n') + return + cls.master_seed = _derive_master_seed() + cls.iters = _parse_int_env(ITERS_ENV) or cls.DEFAULT_ITERS + sys.stderr.write( + f'>>>> Client.dataframe fuzz: master_seed=' + f'{_format_seed(cls.master_seed)}, iters={cls.iters}\n') + + def setUp(self): + self.server = QwpAckServer() + self.server.start() + self.conf = ( + f'qwpws::addr=127.0.0.1:{self.server.port};' + 'pool_size=1;pool_max=1;pool_reap=manual;') + + def tearDown(self): + self.server.stop() + + def _seed_msg(self, iter_seed): + if self.master_seed is None: + return f'iter={_format_seed(iter_seed)}' + return ( + f'master={_format_seed(self.master_seed)}, ' + f'iter={_format_seed(iter_seed)}') + + def _check_one(self, client, df, kwargs, expected_supported, + iter_seed, prev_binary_frames): + """Run one iteration. Returns the new ``binary_frames`` count so + the loop can advance ``prev`` without an extra snapshot.""" + try: + client.dataframe(df, **kwargs) + except qi.UnsupportedDataFrameShapeError as exc: + self.assertEqual( + exc.code, qi.IngressErrorCode.BadDataFrame, + f'UnsupportedDataFrameShapeError did not carry ' + f'BadDataFrame code; {self._seed_msg(iter_seed)}') + self.assertFalse( + expected_supported, + f'Client rejected an expected-supported frame; ' + f'{self._seed_msg(iter_seed)}: {exc}') + cur = self.server.snapshot()['binary_frames'] + self.assertEqual( + cur, prev_binary_frames, + f'rejection published a binary frame; ' + f'{self._seed_msg(iter_seed)}') + return cur + except qi.IngressError as exc: + self.assertFalse( + expected_supported, + f'Client raised IngressError for an expected-supported frame; ' + f'{self._seed_msg(iter_seed)}: {exc.code}: {exc}') + cur = self.server.snapshot()['binary_frames'] + self.assertEqual( + cur, prev_binary_frames, + f'IngressError rejection published a binary frame; ' + f'{self._seed_msg(iter_seed)}: {exc.code}: {exc}') + return cur + # Accept path. + self.assertTrue( + expected_supported, + f'Client accepted an expected-rejected frame; ' + f'{self._seed_msg(iter_seed)}') + cur = self.server.snapshot()['binary_frames'] + if len(df) == 0: + self.assertEqual( + cur, prev_binary_frames, + f'empty df published a binary frame; ' + f'{self._seed_msg(iter_seed)}') + else: + self.assertGreater( + cur, prev_binary_frames, + f'accepted non-empty df published no binary frame; ' + f'{self._seed_msg(iter_seed)}') + return cur + + def _iter_seeds(self): + if self.iter_seed_override is not None: + return [self.iter_seed_override] + master = Rng(self.master_seed) + return [master.next_long() for _ in range(self.iters)] + + def _master_label(self): + if self.master_seed is None: + return f'iter_seed_override={_format_seed(self.iter_seed_override)}' + return f'master_seed={_format_seed(self.master_seed)}' + + def test_fuzz_round_trip(self): + seeds = self._iter_seeds() + client = qi.Client.from_conf(self.conf) + failures = [] + try: + prev = 0 + for iter_seed in seeds: + rng = Rng(iter_seed) + try: + df, kwargs, expected_supported = _build_frame(rng) + prev = self._check_one( + client, df, kwargs, expected_supported, + iter_seed, prev) + except AssertionError as exc: + failures.append((iter_seed, type(exc).__name__, str(exc))) + prev = self.server.snapshot()['binary_frames'] + except Exception as exc: # noqa: BLE001 — fuzz triage + failures.append(( + iter_seed, type(exc).__name__, repr(exc))) + prev = self.server.snapshot()['binary_frames'] + finally: + client.close() + + stats = self.server.snapshot() + self.assertEqual( + stats['errors'], [], + f'server saw protocol errors: {stats["errors"]}; ' + f'{self._master_label()}') + self.assertEqual( + stats['accepted_connections'], 1, + f'expected 1 TCP accept across {len(seeds)} iterations, ' + f'saw {stats["accepted_connections"]}; ' + f'{self._master_label()}') + + if failures: + preview = '\n'.join( + f' iter={_format_seed(s)} [{cls}]: {m}' + for s, cls, m in failures[:5]) + self.fail( + f'{len(failures)}/{len(seeds)} iterations failed.\n' + f'{self._master_label()}\n' + f'(showing first 5)\n{preview}') + + # ------- Focused property tests below. Reliable, non-fuzz. ------- + + def test_rejects_non_column_at_arguments(self): + df = pd.DataFrame({ + 'ts': pd.Series(_datetime_array(2)), + 'seq': pd.Series([1, 2], dtype='int64'), + }) + client = qi.Client.from_conf(self.conf) + try: + for at_val in ( + qi.ServerTimestamp, + qi.TimestampNanos(1_700_000_000_000_000_000), + datetime.datetime(2024, 1, 1)): + with self.assertRaises( + qi.UnsupportedDataFrameShapeError, + msg=f'at={at_val!r} should be rejected'): + client.dataframe(df, table_name='t', at=at_val) + finally: + client.close() + self.assertEqual(self.server.snapshot()['binary_frames'], 0) + + def test_rejects_table_name_col(self): + df = pd.DataFrame({ + 'ts': pd.Series(_datetime_array(2)), + 'tbl': pd.Series(['a', 'b'], dtype='string[pyarrow]'), + 'seq': pd.Series([1, 2], dtype='int64'), + }) + client = qi.Client.from_conf(self.conf) + try: + with self.assertRaises(qi.UnsupportedDataFrameShapeError): + client.dataframe(df, table_name_col='tbl', at='ts') + finally: + client.close() + self.assertEqual(self.server.snapshot()['binary_frames'], 0) + + def test_closed_client_methods_reject(self): + client = qi.Client.from_conf(self.conf) + client.close() + df = pd.DataFrame({ + 'ts': pd.Series(_datetime_array(1)), + 'seq': pd.Series([1], dtype='int64'), + }) + + def _call_dataframe(c): + c.dataframe(df, table_name='t', at='ts') + + def _call_reap(c): + c.reap_idle() + + def _call_enter(c): + c.__enter__() + + for op in (_call_dataframe, _call_reap, _call_enter): + with self.assertRaises(qi.IngressError) as cm: + op(client) + self.assertEqual( + cm.exception.code, qi.IngressErrorCode.InvalidApiCall, + f'{op.__name__} on closed client should raise InvalidApiCall') + + # close() must remain idempotent on a closed client. + client.close() + client.close() + + def test_multi_chunk_emission(self): + """Force ``len(df)`` above the planner's per-chunk row cap so the + chunk-split loop, deferred-flush path, and final sync are all + exercised. The Arrow-string planner cap is 32 000, so 32 001 + rows guarantees two chunks (32 000 + 1).""" + n_rows = 32_001 + rng = Rng(0xc4_c0_de_b1_05_1d_75_3d) # deterministic, arbitrary + items = _random_strings(rng, n_rows, 8, 0.0) + df = pd.DataFrame({ + 'ts': pd.Series(_datetime_array(n_rows, 'ns')), + 's': pd.Series(items, dtype='string[pyarrow]'), + 'seq': pd.Series(np.arange(n_rows, dtype=np.int64)), + }) + client = qi.Client.from_conf(self.conf) + try: + qi._debug_dataframe_columnar_io_stats(enabled=True, reset=True) + try: + client.dataframe(df, table_name='multi_chunk', at='ts', + symbols=False) + finally: + io_stats = qi._debug_dataframe_columnar_io_stats( + enabled=False) + finally: + client.close() + self.assertGreaterEqual( + io_stats['flush_calls'], 2, + f'multi-chunk emission expected >=2 flushes; ' + f'got io_stats={io_stats}') + self.assertEqual( + io_stats['sync_calls'], 1, + f'expected exactly one sync per Client.dataframe() call; ' + f'got io_stats={io_stats}') + stats = self.server.snapshot() + self.assertEqual(stats['errors'], []) + self.assertGreaterEqual(stats['binary_frames'], 2) + + def test_multi_chunk_with_nulls(self): + """Force multi-chunk emission with a nullable categorical so the + validity bitmap must be sliced across chunk boundaries. + + The categorical-symbols planner cap is 100 000 and the planner + rounds chunk size to a multiple of 8 when validity is present. + Using > 100 000 rows guarantees at least two chunks; randomly + sprinkled nulls verify ``(arr.buffers[0]) + (row_offset + // 8)`` lands on the correct byte for the second chunk.""" + n_rows = 100_003 # > 100k cap + force a 3-row tail chunk + rng = Rng(0xa17e_4c91_55_42_99_03) + sym_pool = [f'S{i:04d}' for i in range(64)] + choices = [ + None if rng.chance(0.15) else sym_pool[rng.next_int(64)] + for _ in range(n_rows)] + df = pd.DataFrame({ + 'ts': pd.Series(_datetime_array(n_rows, 'ns')), + 'sym': pd.Series(pd.Categorical( + choices, dtype=pd.CategoricalDtype(categories=sym_pool))), + 'seq': pd.Series(np.arange(n_rows, dtype=np.int64)), + }) + client = qi.Client.from_conf(self.conf) + try: + qi._debug_dataframe_columnar_io_stats(enabled=True, reset=True) + try: + client.dataframe(df, table_name='mc_nulls', at='ts') + finally: + io_stats = qi._debug_dataframe_columnar_io_stats( + enabled=False) + finally: + client.close() + self.assertGreaterEqual( + io_stats['flush_calls'], 2, + f'expected >=2 flushes; io_stats={io_stats}') + self.assertEqual(io_stats['sync_calls'], 1) + stats = self.server.snapshot() + self.assertEqual(stats['errors'], []) + self.assertGreaterEqual(stats['binary_frames'], 2) + + def test_high_cardinality_symbol_i16(self): + """A categorical with > 128 categories forces the i16-codes + path. The default rng-driven fuzz almost never produces enough + cardinality to reach this branch.""" + n_rows = 1_000 + cardinality = 200 # > 128 -> i16 + rng = Rng(0xc1d_7e_4f_55_42_de_ad) + pool = [f'C{i:04d}_{chr(0x0391 + (i % 24))}' for i in range(cardinality)] + choices = [pool[rng.next_int(cardinality)] for _ in range(n_rows)] + df = pd.DataFrame({ + 'ts': pd.Series(_datetime_array(n_rows, 'ns')), + 'sym': pd.Series(pd.Categorical( + choices, dtype=pd.CategoricalDtype(categories=pool))), + 'seq': pd.Series(np.arange(n_rows, dtype=np.int64)), + }) + # Sanity: pandas should have picked an int16 code width. + self.assertEqual( + df['sym'].cat.codes.dtype, np.int16, + 'expected i16 code width for cardinality > 128') + client = qi.Client.from_conf(self.conf) + try: + client.dataframe(df, table_name='hi_card_sym', at='ts') + finally: + client.close() + stats = self.server.snapshot() + self.assertEqual(stats['errors'], []) + self.assertGreaterEqual(stats['binary_frames'], 1) + + def test_wide_frame_multi_chunk(self): + """A frame with > 8 field columns hits the planner's + ``rows_per_chunk = 64_000`` branch. Using 64 001 rows guarantees + chunk-split through the wide-frame path (distinct from the + Arrow-string and categorical-symbols caps exercised elsewhere).""" + n_rows = 64_001 + n_int_cols = 12 + df_cols = {'ts': pd.Series(_datetime_array(n_rows, 'ns'))} + seq = np.arange(n_rows, dtype=np.int64) + for i in range(n_int_cols): + df_cols[f'i{i:02d}'] = pd.Series(seq + i * 1_000_000) + df = pd.DataFrame(df_cols) + client = qi.Client.from_conf(self.conf) + try: + qi._debug_dataframe_columnar_io_stats(enabled=True, reset=True) + try: + client.dataframe(df, table_name='wide', at='ts', + symbols=False) + finally: + io_stats = qi._debug_dataframe_columnar_io_stats( + enabled=False) + finally: + client.close() + self.assertGreaterEqual( + io_stats['flush_calls'], 2, + f'expected >=2 flushes for wide-frame multi-chunk; ' + f'io_stats={io_stats}') + self.assertEqual(io_stats['sync_calls'], 1) + stats = self.server.snapshot() + self.assertEqual(stats['errors'], []) + self.assertGreaterEqual(stats['binary_frames'], 2) + + def test_sequential_client_lifecycle(self): + """Open, use, and close a fresh Client many times in succession. + Each cycle opens a new TCP connection (because the prior Client + was closed); we verify that lifecycle is clean across repeated + open/close cycles, no leaks, no server-side protocol errors.""" + n_cycles = 30 + rng = Rng(0x115ec_2_f0a_55_42) + df = pd.DataFrame({ + 'ts': pd.Series(_datetime_array(8, 'ns')), + 'seq': pd.Series(np.arange(8, dtype=np.int64)), + 's': pd.Series(_random_strings(rng, 8, 8, 0.0), + dtype='string[pyarrow]'), + }) + for _ in range(n_cycles): + client = qi.Client.from_conf(self.conf) + try: + client.dataframe(df, table_name='seq_lifecycle', at='ts', + symbols=False) + finally: + client.close() + stats = self.server.snapshot() + self.assertEqual( + stats['errors'], [], + f'server saw protocol errors across {n_cycles} cycles: ' + f'{stats["errors"]}') + self.assertEqual( + stats['accepted_connections'], n_cycles, + f'expected {n_cycles} accepts, saw ' + f'{stats["accepted_connections"]}') + self.assertGreaterEqual(stats['binary_frames'], n_cycles) + + def test_empty_dataframe_is_noop(self): + df = pd.DataFrame({ + 'ts': pd.Series([], dtype='datetime64[ns]'), + 'seq': pd.Series([], dtype='int64'), + }) + client = qi.Client.from_conf(self.conf) + try: + client.dataframe(df, table_name='t', at='ts') + finally: + client.close() + stats = self.server.snapshot() + self.assertEqual(stats['binary_frames'], 0) + self.assertEqual(stats['errors'], []) + + def test_from_conf_rejects_non_qwp_websocket(self): + with self.assertRaises(qi.IngressError) as cm: + qi.Client.from_conf('tcp::addr=localhost:9009;') + self.assertEqual(cm.exception.code, qi.IngressErrorCode.ConfigError) + + def test_from_conf_requires_addr(self): + with self.assertRaises(qi.IngressError) as cm: + qi.Client.from_conf('qwpws::pool_size=1;') + self.assertEqual(cm.exception.code, qi.IngressErrorCode.ConfigError) + + +@unittest.skipIf(pd is None or pa is None, 'pandas/pyarrow not installed') +class TestClientDataframeSfaFuzz(TestClientDataframeFuzz): + """Same pandas fuzz/property suite, through the columnar SFA backend.""" + + DEFAULT_ITERS = 25 + + def setUp(self): + self.server = QwpAckServer() + self.server.start() + self._sf_tmp = tempfile.TemporaryDirectory( + prefix='client-df-sfa-fuzz-') + self.sender_id = 'py-df-fuzz-' + uuid.uuid4().hex[:8] + self.conf = _sfa_conf( + self.server.port, + self.sender_id, + self._sf_tmp.name) + + def tearDown(self): + try: + self.assertEqual( + _sfa_file_count(self._sf_tmp.name, self.sender_id), + 0, + f'SFA files left after dataframe test; ' + f'{self._master_label()}') + finally: + self.server.stop() + self._sf_tmp.cleanup() + + +# --------------------------------------------------------------------------- +# Round-trip fuzz against a real QuestDB. Gated on QDB_REPO_PATH, matching +# system_test.py's convention. +# --------------------------------------------------------------------------- + + +def _normalize_for_compare(df): + """Project a DataFrame onto a representation that compares cleanly + across the QuestDB round-trip. + + Drops the QuestDB-renamed designated-timestamp column (caller is + expected to compare it separately if needed). Coerces categorical + and any string-flavoured dtype to plain `object` strings. Sorts + columns alphabetically. + """ + df = df.copy() + df = df.reindex(sorted(df.columns), axis=1) + out = {} + for col in df.columns: + s = df[col] + if (isinstance(s.dtype, pd.CategoricalDtype) + or pd.api.types.is_string_dtype(s.dtype)): + out[col] = s.astype('object') + elif pd.api.types.is_datetime64_any_dtype(s.dtype): + # Strip timezone (QuestDB always returns UTC; source may be + # tz-naive) and normalise to microsecond resolution to match + # QuestDB's TIMESTAMP precision on round-trip. + v = s.dt.tz_convert(None) if s.dt.tz is not None else s + out[col] = v.astype('datetime64[us]') + else: + out[col] = s + return pd.DataFrame(out) + + +@unittest.skipUnless( + os.environ.get('QDB_REPO_PATH') and pd is not None and pa is not None, + 'Round-trip fuzz needs a real QuestDB. Set QDB_REPO_PATH= ' + 'to enable. Matches the gating convention in system_test.py.') +class TestClientDataframeRoundTrip(unittest.TestCase): + """Ingest via Client.dataframe → real QuestDB → read back via + Client.query → assert frame equivalence. + + Set ``QDB_REPO_PATH=/path/to/questdb`` to enable. Uses a class-scoped + QuestDB fixture (one process; tables are dropped between iterations). + """ + + DEFAULT_ITERS = 8 + + @classmethod + def setUpClass(cls): + # Import the heavy fixture infra only when this test class runs. + import importlib + cls._fixture_mod = importlib.import_module('fixture') + repo = os.environ.get('QDB_REPO_PATH') + if not repo: + raise unittest.SkipTest( + 'QDB_REPO_PATH required for Layer-3 fuzz') + install_path = cls._fixture_mod.install_questdb_from_repo( + __import__('pathlib').Path(repo)) + import shutil + plain_dir = PROJ_ROOT / 'build' / 'questdb' / 'layer3' + plain_dir.mkdir(parents=True, exist_ok=True) + shutil.copytree(install_path, plain_dir, dirs_exist_ok=True) + cls.qdb = cls._fixture_mod.QuestDbFixture( + plain_dir, auth=False, http=True) + cls.qdb.start() + + cls.iter_seed_override = _parse_int_env(ITER_SEED_ENV) + if cls.iter_seed_override is not None: + cls.master_seed = None + cls.iters = 1 + else: + cls.master_seed = _derive_master_seed() + cls.iters = _parse_int_env(ITERS_ENV) or cls.DEFAULT_ITERS + sys.stderr.write( + f'>>>> Round-trip fuzz vs real QuestDB: ' + f'master={_format_seed(cls.master_seed) if cls.master_seed else "n/a"}, ' + f'iter_override=' + f'{_format_seed(cls.iter_seed_override) if cls.iter_seed_override else "n/a"}, ' + f'iters={cls.iters}\n') + + @classmethod + def tearDownClass(cls): + if getattr(cls, 'qdb', None) is not None: + cls.qdb.stop() + + @property + def conf(self): + return (f'qwpws::addr={self.qdb.host}:' + f'{self.qdb.http_server_port};') + + def _wait_for_rows(self, table_name, expected, timeout_s=30): + deadline = time.monotonic() + timeout_s + while time.monotonic() < deadline: + try: + res = self.qdb.http_sql_query( + f'SELECT count() FROM {table_name}') + except Exception: + time.sleep(0.1) + continue + rows = res.get('dataset') or [] + if rows and rows[0][0] >= expected: + return + time.sleep(0.1) + raise RuntimeError( + f'WAL apply timed out: {expected} rows expected on {table_name}') + + def _drop_table(self, table_name): + try: + self.qdb.http_sql_query(f'DROP TABLE IF EXISTS {table_name}') + except Exception: + pass + + # Round-trip generators avoid QuestDB's sentinel-value collisions: + # INT64_MIN aliases LONG null, NaN aliases DOUBLE null. The fuzz + # generators in this module deliberately sprinkle those values to + # exercise the wire encoder; for the Layer-3 round-trip oracle + # we need lossless inputs. + @staticmethod + def _gen_int64_safe(rng, n): + out = np.empty(n, dtype=np.int64) + for i in range(n): + out[i] = int(rng.uniform(-(1 << 50), 1 << 50)) + return pd.Series(out) + + @staticmethod + def _gen_float64_safe(rng, n): + out = np.empty(n, dtype=np.float64) + for i in range(n): + out[i] = rng.uniform(-1e6, 1e6) + return pd.Series(out) + + def _build_simple_frame(self, rng): + """Hand-picked frame shapes for round-trip. Each is a + type-coverage probe rather than a max-entropy fuzz; this + keeps normalisation tractable for first-cut Layer-3.""" + n_rows = max(rng.choice(ROW_COUNT_CHOICES), 1) + cols = { + 'ts': pd.Series(_datetime_array(n_rows, 'ns')), + 'id': pd.Series(np.arange(1, n_rows + 1, dtype=np.int64)), + } + shape = rng.choice(['numeric', 'string', 'categorical', 'mixed']) + if shape in ('numeric', 'mixed'): + cols['price'] = self._gen_float64_safe(rng, n_rows) + cols['count'] = self._gen_int64_safe(rng, n_rows) + if shape in ('string', 'mixed'): + cols['note'] = pd.Series( + _random_strings(rng, n_rows, 8, 0.0, ascii_only=True), + dtype='string[pyarrow]') + if shape in ('categorical', 'mixed'): + cols['sym'] = _gen_categorical(rng, n_rows) + return pd.DataFrame(cols), shape, n_rows + + def _iter_seeds(self): + if self.iter_seed_override is not None: + return [self.iter_seed_override] + master = Rng(self.master_seed) + return [master.next_long() for _ in range(self.iters)] + + def test_round_trip(self): + seeds = self._iter_seeds() + failures = [] + for iter_idx, iter_seed in enumerate(seeds): + rng = Rng(iter_seed) + table_name = f'rt_{iter_idx}_{iter_seed:016x}' + try: + df, shape, n_rows = self._build_simple_frame(rng) + self._drop_table(table_name) + with qi.Client.from_conf(self.conf) as client: + client.dataframe(df, table_name=table_name, at='ts') + self._wait_for_rows(table_name, n_rows) + + # Read back. Project out 'ts' (renamed to 'timestamp') + # so the comparison stays tractable. + cols = [c for c in df.columns if c != 'ts'] + sql = (f"SELECT {','.join(cols)} FROM {table_name} " + f"ORDER BY id") + with qi.Client.from_conf(self.conf) as client: + result = client.query(sql) + df_out = result.to_pandas() + + df_in_norm = _normalize_for_compare( + df[cols].sort_values('id').reset_index(drop=True)) + df_out_norm = _normalize_for_compare( + df_out.sort_values('id').reset_index(drop=True)) + pd.testing.assert_frame_equal( + df_in_norm, df_out_norm, + check_dtype=False, check_like=True) + except Exception as exc: + failures.append( + (iter_seed, shape if 'shape' in locals() else '?', + type(exc).__name__, repr(exc))) + # Try to drop the table to keep iterations independent. + self._drop_table(table_name) + + if failures: + preview = '\n'.join( + f' iter={_format_seed(s)} shape={sh} [{cls}]: {m}' + for s, sh, cls, m in failures[:5]) + self.fail( + f'{len(failures)}/{len(seeds)} iterations failed.\n' + f'(showing first 5)\n{preview}') + + def test_targeted_payload_semantics(self): + table_name = f'rt_payload_{uuid.uuid4().hex[:8]}' + ts_values = np.array([ + '2024-01-01T00:00:00.123456', + '2024-01-01T00:00:01.654321', + '2024-01-01T00:00:02.000000', + '2024-01-01T00:00:03.999999', + ], dtype='datetime64[us]') + large_text_values = ['alpha', 'bravo', None, 'cafe'] + dict_text_values = ['EUR', 'USD', None, 'EUR'] + df = pd.DataFrame({ + 'ts': pd.Series(ts_values), + 'seq': pd.Series([1, 2, 3, 4], dtype=np.int64), + 'large_text': pd.Series( + pa.array(large_text_values, type=pa.large_string()), + dtype=pd.ArrowDtype(pa.large_string())), + 'dict_text': pd.Series( + pa.array(dict_text_values, type=pa.large_string()), + dtype=pd.ArrowDtype(pa.large_string())).astype('category'), + }) + + try: + self._drop_table(table_name) + with qi.Client.from_conf(self.conf) as client: + client.dataframe(df, table_name=table_name, at='ts') + self._wait_for_rows(table_name, len(df)) + + with qi.Client.from_conf(self.conf) as client: + table = client.query( + f'SELECT timestamp, seq, large_text, dict_text ' + f'FROM {table_name} ORDER BY seq').to_arrow() + + self.assertEqual(table.num_rows, len(df)) + actual_ts = table.column('timestamp').to_pandas() + if actual_ts.dt.tz is not None: + actual_ts = actual_ts.dt.tz_convert(None) + expected_ts = pd.Series(ts_values) + pd.testing.assert_series_equal( + actual_ts.astype('datetime64[us]').reset_index(drop=True), + expected_ts.astype('datetime64[us]'), + check_names=False) + self.assertEqual( + table.column('large_text').to_pylist(), + large_text_values) + self.assertEqual( + table.column('dict_text').to_pylist(), + dict_text_values) + finally: + self._drop_table(table_name) + + def test_targeted_timestamp_units_round_trip(self): + source_values = [ + '2024-01-01T00:00:00.000000', + '2024-01-01T00:00:01.123000', + '2024-01-01T00:00:02.456000', + ] + for unit in ('s', 'ms', 'us', 'ns'): + table_name = f'rt_ts_{unit}_{uuid.uuid4().hex[:8]}' + ts_values = np.array(source_values, dtype=f'datetime64[{unit}]') + df = pd.DataFrame({ + 'ts': pd.Series(ts_values), + 'seq': pd.Series([1, 2, 3], dtype=np.int64), + }) + + try: + self._drop_table(table_name) + with qi.Client.from_conf(self.conf) as client: + client.dataframe(df, table_name=table_name, at='ts') + self._wait_for_rows(table_name, len(df)) + + with qi.Client.from_conf(self.conf) as client: + table = client.query( + f'SELECT timestamp, seq FROM {table_name} ' + f'ORDER BY seq').to_arrow() + + actual_ts = table.column('timestamp').to_pandas() + if actual_ts.dt.tz is not None: + actual_ts = actual_ts.dt.tz_convert(None) + expected_ts = pd.Series(ts_values).astype('datetime64[us]') + pd.testing.assert_series_equal( + actual_ts.astype('datetime64[us]').reset_index(drop=True), + expected_ts.reset_index(drop=True), + check_names=False) + finally: + self._drop_table(table_name) + + def test_targeted_rust_arrow_classifier_numeric_round_trip(self): + table_name = f'rt_arrow_numeric_{uuid.uuid4().hex[:8]}' + ts_type = pa.timestamp('ms', tz='UTC') + df = pd.DataFrame({ + 'ts': pd.Series( + pa.array( + [1704067200000, 1704067201000, 1704067202000], + type=ts_type), + dtype=pd.ArrowDtype(ts_type)), + 'seq': pd.Series([1, 2, 3], dtype=np.int64), + 'u8': pd.Series( + pa.array([1, 2, 255], type=pa.uint8()), + dtype=pd.ArrowDtype(pa.uint8())), + 'u16': pd.Series( + pa.array([1000, 2000, 3000], type=pa.uint16()), + dtype=pd.ArrowDtype(pa.uint16())), + 'u64': pd.Series( + pa.array([1, 2 ** 31, 2 ** 40], type=pa.uint64()), + dtype=pd.ArrowDtype(pa.uint64())), + 'f16': pd.Series( + pa.array(np.array([1.5, 2.5, 3.5], dtype=np.float16), + type=pa.float16()), + dtype=pd.ArrowDtype(pa.float16())), + }) + + try: + self._drop_table(table_name) + with qi.Client.from_conf(self.conf) as client: + client.dataframe(df, table_name=table_name, at='ts') + self._wait_for_rows(table_name, len(df)) + + with qi.Client.from_conf(self.conf) as client: + table = client.query( + f'SELECT timestamp, seq, u8, u16, u64, f16 ' + f'FROM {table_name} ORDER BY seq').to_arrow() + + actual_ts = table.column('timestamp').to_pandas() + if actual_ts.dt.tz is not None: + actual_ts = actual_ts.dt.tz_convert(None) + expected_ts = pd.Series( + ['2024-01-01T00:00:00.000000', + '2024-01-01T00:00:01.000000', + '2024-01-01T00:00:02.000000'], + dtype='datetime64[us]') + pd.testing.assert_series_equal( + actual_ts.astype('datetime64[us]').reset_index(drop=True), + expected_ts.reset_index(drop=True), + check_names=False) + self.assertEqual(table.column('u8').to_pylist(), [1, 2, 255]) + self.assertEqual( + table.column('u16').to_pylist(), [1000, 2000, 3000]) + self.assertEqual( + table.column('u64').to_pylist(), [1, 2 ** 31, 2 ** 40]) + np.testing.assert_allclose( + np.array(table.column('f16').to_pylist(), dtype=np.float32), + np.array([1.5, 2.5, 3.5], dtype=np.float32)) + finally: + self._drop_table(table_name) + + +# Late imports for the round-trip class. +import time # noqa: E402 + + +if __name__ == '__main__': + unittest.main() diff --git a/test/test_client_polars_fuzz.py b/test/test_client_polars_fuzz.py new file mode 100644 index 00000000..1986856a --- /dev/null +++ b/test/test_client_polars_fuzz.py @@ -0,0 +1,726 @@ +#!/usr/bin/env python3 +"""Fuzz tests for polars DataFrame ingestion via ``Client.dataframe()``. + +Polars frames take the Arrow C Stream capsule path +(``__arrow_c_stream__``) — pyarrow-free. Every iteration builds a random +polars frame and ingests it to a local ``QwpAckServer``, asserting: + + * the client's accept / reject decision matches the static rule, and + * an accepted non-empty frame publishes at least one binary frame while + a rejection (or an empty frame) publishes none. + +Unlike the pandas fuzz, every polars *field* dtype the capsule path sees +(ints, uints, floats, bool, utf8, categorical, enum, date, datetime) is +supported, so the only accept/reject axis is the designated-timestamp +column: it must contain no nulls and no pre-epoch values. + +Reproduce one failing iteration with its seed: + + QDB_CLIENT_FUZZ_ITER_SEED=0x... \\ + python -m unittest test.test_client_polars_fuzz +""" +import sys + +sys.dont_write_bytecode = True + +import datetime +import decimal +import math +import os +import tempfile +import time +import unittest +import uuid + +import patch_path +patch_path.patch() + +import questdb.ingress as qi + +PROJ_ROOT = patch_path.PROJ_ROOT +sys.path.append(str(PROJ_ROOT / 'c-questdb-client' / 'system_test')) +from qwp_ws_ack_server import QwpAckServer + +# Shared seed/RNG helpers from the pandas fuzz; its pandas/pyarrow imports +# are lazy, so reusing them keeps this module pyarrow-free. +from test_client_dataframe_fuzz import ( + Rng, + _random_strings, + _weighted_pick_value, + _weighted_pick_kv, + _parse_int_env, + _derive_master_seed, + _format_seed, + _sfa_conf, + _sfa_file_count, + ITER_SEED_ENV, + ITERS_ENV, + ROW_COUNT_CHOICES, +) + +try: + import polars as pl +except ImportError: + pl = None + + +# --------------------------------------------------------------------------- +# Field generators. Each returns a polars Series of length n (named later by +# the DataFrame dict key). All produce capsule-path-supported columns. +# --------------------------------------------------------------------------- + + +def _int_series(rng, n, dtype, lo, hi): + null_prob = 0.2 if rng.next_bool() else 0.0 + specials = (lo, hi, 0, 1, -1 if lo < 0 else 0) + span = hi - lo + 1 + out = [] + for _ in range(n): + if null_prob and rng.chance(null_prob): + out.append(None) + elif rng.chance(0.05): + out.append(rng.choice(specials)) + else: + out.append(lo + rng.next_int(span)) + return pl.Series(out, dtype=dtype) + + +def _gen_i8(rng, n): + return _int_series(rng, n, pl.Int8, -128, 127) + + +def _gen_i16(rng, n): + return _int_series(rng, n, pl.Int16, -32768, 32767) + + +def _gen_i32(rng, n): + return _int_series(rng, n, pl.Int32, -(1 << 31), (1 << 31) - 1) + + +def _gen_i64(rng, n): + return _int_series(rng, n, pl.Int64, -(1 << 50), (1 << 50)) + + +def _gen_u8(rng, n): + return _int_series(rng, n, pl.UInt8, 0, 255) + + +def _gen_u16(rng, n): + return _int_series(rng, n, pl.UInt16, 0, 65535) + + +def _gen_u32(rng, n): + return _int_series(rng, n, pl.UInt32, 0, (1 << 32) - 1) + + +def _gen_u64(rng, n): + # Keep below i64::MAX; QuestDB QWP encodes integers as signed i64. + return _int_series(rng, n, pl.UInt64, 0, (1 << 62)) + + +_FLOAT_SPECIALS = ( + 0.0, -0.0, 1.0, -1.0, + float('nan'), float('inf'), float('-inf'), + 1e-300, 1e300) + + +def _float_series(rng, n, dtype): + null_prob = 0.2 if rng.next_bool() else 0.0 + out = [] + for _ in range(n): + if null_prob and rng.chance(null_prob): + out.append(None) + elif rng.chance(0.05): + out.append(rng.choice(_FLOAT_SPECIALS)) + else: + out.append(rng.uniform(-1e6, 1e6)) + return pl.Series(out, dtype=dtype) + + +def _gen_f32(rng, n): + return _float_series(rng, n, pl.Float32) + + +def _gen_f64(rng, n): + return _float_series(rng, n, pl.Float64) + + +def _gen_bool(rng, n): + null_prob = 0.2 if rng.next_bool() else 0.0 + out = [None if (null_prob and rng.chance(null_prob)) else rng.next_bool() + for _ in range(n)] + return pl.Series(out, dtype=pl.Boolean) + + +def _gen_utf8(rng, n): + null_prob = 0.2 if rng.next_bool() else 0.0 + return pl.Series(_random_strings(rng, n, 16, null_prob), dtype=pl.Utf8) + + +def _cat_pool(rng): + cardinality = max(2, rng.next_int(16) + 2) + pool = list(dict.fromkeys(_random_strings(rng, cardinality * 2, 8, 0.0))) + while len(pool) < 2: + pool.append(f'_pad_{len(pool)}') + return pool + + +def _gen_categorical(rng, n): + pool = _cat_pool(rng) + null_prob = 0.2 if rng.next_bool() else 0.0 + out = [None if (null_prob and rng.chance(null_prob)) + else pool[rng.next_int(len(pool))] + for _ in range(n)] + return pl.Series(out, dtype=pl.Categorical) + + +def _gen_enum(rng, n): + pool = _cat_pool(rng) + null_prob = 0.2 if rng.next_bool() else 0.0 + out = [None if (null_prob and rng.chance(null_prob)) + else pool[rng.next_int(len(pool))] + for _ in range(n)] + return pl.Series(out, dtype=pl.Enum(pool)) + + +def _gen_date(rng, n): + # Datetime/Date map to a non-nullable QuestDB TIMESTAMP, so no nulls. + base = datetime.date(2020, 1, 1) + out = [base + datetime.timedelta(days=rng.next_int(4000)) + for _ in range(n)] + return pl.Series(out, dtype=pl.Date) + + +def _gen_dt_us(rng, n): + base = datetime.datetime(2024, 1, 1) + out = [base + datetime.timedelta(seconds=rng.next_int(1 << 20)) + for _ in range(n)] + return pl.Series(out, dtype=pl.Datetime('us')) + + +def _gen_dt_tz(rng, n): + base = datetime.datetime(2024, 1, 1, tzinfo=datetime.timezone.utc) + out = [base + datetime.timedelta(seconds=rng.next_int(1 << 20)) + for _ in range(n)] + return pl.Series(out, dtype=pl.Datetime('us', time_zone='UTC')) + + +def _gen_time(rng, n): + out = [datetime.time(rng.next_int(24), rng.next_int(60), rng.next_int(60)) + for _ in range(n)] + return pl.Series(out, dtype=pl.Time) + + +def _gen_decimal(rng, n): + out = [decimal.Decimal(rng.next_int(2_000_000) - 1_000_000).scaleb(-2) + for _ in range(n)] + return pl.Series(out, dtype=pl.Decimal(18, 2)) + + +def _gen_binary(rng, n): + out = [bytes(rng.next_int(256) for _ in range(rng.next_int(8))) + for _ in range(n)] + return pl.Series(out, dtype=pl.Binary) + + +def _gen_list_f64(rng, n): + out = [[rng.uniform(-1e3, 1e3) for _ in range(rng.next_int(4))] + for _ in range(n)] + return pl.Series(out, dtype=pl.List(pl.Float64)) + + +def _gen_array_f64(rng, n): + width = rng.next_int(3) + 1 + out = [[rng.uniform(-1e3, 1e3) for _ in range(width)] for _ in range(n)] + return pl.Series(out, dtype=pl.Array(pl.Float64, width)) + + +# (kind, gen, weight, string_like) +_FIELD_GENS = [ + ('i8', _gen_i8, 6, False), + ('i16', _gen_i16, 6, False), + ('i32', _gen_i32, 6, False), + ('i64', _gen_i64, 8, False), + ('u8', _gen_u8, 5, False), + ('u16', _gen_u16, 5, False), + ('u32', _gen_u32, 5, False), + ('u64', _gen_u64, 5, False), + ('f32', _gen_f32, 6, False), + ('f64', _gen_f64, 8, False), + ('bool', _gen_bool, 6, False), + ('utf8', _gen_utf8, 14, True), + ('categorical', _gen_categorical, 16, True), + ('enum', _gen_enum, 10, True), + ('date', _gen_date, 6, False), + ('dt_us', _gen_dt_us, 6, False), + ('dt_tz', _gen_dt_tz, 5, False), + ('time', _gen_time, 4, False), + ('decimal', _gen_decimal, 5, False), + ('binary', _gen_binary, 5, False), + ('list_f64', _gen_list_f64, 6, False), + ('array_f64', _gen_array_f64, 6, False), +] + + +# --------------------------------------------------------------------------- +# Designated-timestamp generators. Return (series, at_ok). +# --------------------------------------------------------------------------- + + +def _ts_valid(rng, n): + base = datetime.datetime(2024, 1, 1) + vals = [base + datetime.timedelta(seconds=i) for i in range(n)] + unit = 'ns' if rng.next_bool() else 'us' + return pl.Series(vals, dtype=pl.Datetime(unit)), True + + +def _ts_null(rng, n): + if n == 0: + return _ts_valid(rng, n) + base = datetime.datetime(2024, 1, 1) + vals = [base + datetime.timedelta(seconds=i) for i in range(n)] + for i in rng.sample(range(n), max(1, n // 8)): + vals[i] = None + return pl.Series(vals, dtype=pl.Datetime('us')), False + + +def _ts_pre_epoch(rng, n): + if n == 0: + return _ts_valid(rng, n) + base = datetime.datetime(1900, 1, 1) + vals = [base + datetime.timedelta(seconds=i) for i in range(n)] + return pl.Series(vals, dtype=pl.Datetime('us')), False + + +def _ts_wrong_type(rng, n): + return pl.Series(list(range(n)), dtype=pl.Int64), False + + +_AT_GENS = [ + (_ts_valid, 76), + (_ts_null, 8), + (_ts_pre_epoch, 8), + (_ts_wrong_type, 8), +] + + +def _build_frame(rng): + """Return (frame, kwargs, expected_supported, n_rows).""" + n_rows = rng.choice(ROW_COUNT_CHOICES) + + at_gen = _weighted_pick_value(rng, _AT_GENS) + ts, at_ok = at_gen(rng, n_rows) + + cols = {'ts': ts} + string_like = [] + gen_pool = [(k, g, w) for k, g, w, _ in _FIELD_GENS] + n_field_cols = rng.next_int(5) + for c in range(n_field_cols): + kind, gen = _weighted_pick_kv(rng, gen_pool) + name = f'c{c}_{kind}' + cols[name] = gen(rng, n_rows) + if next(s for k, _, _, s in _FIELD_GENS if k == kind): + string_like.append(name) + + # Empty -> no-op accept. Otherwise reject when the ts is invalid / wrong + # type, or the frame is ts-only (the capsule needs a non-ts column). + expected_supported = (n_rows == 0) or (at_ok and n_field_cols > 0) + + df = pl.DataFrame(cols) + if rng.next_bool(): + order = list(df.columns) + rng.shuffle(order) + df = df.select(order) + + sym_mode = _weighted_pick_value( + rng, [('auto', 6), (False, 3), ('list', 3)]) + if sym_mode == 'list' and string_like: + k = rng.next_int(len(string_like)) + 1 + symbols = rng.sample(string_like, k) + elif sym_mode == 'list': + symbols = 'auto' + else: + symbols = sym_mode + + kwargs = {'table_name': 'polars_fuzz', 'at': 'ts', 'symbols': symbols} + mrpb = _weighted_pick_value(rng, [(None, 5), (2, 2), (8, 2), (64, 2)]) + if mrpb is not None: + kwargs['max_rows_per_batch'] = mrpb + + frame = df.lazy() if rng.chance(0.25) else df + return frame, kwargs, expected_supported, n_rows + + +@unittest.skipUnless(pl is not None, 'polars not installed') +class TestClientPolarsDataframeFuzz(unittest.TestCase): + """Round-trip fuzz: each iteration ingests a random polars frame through + ``Client.dataframe()`` (capsule path) to a local ``QwpAckServer``.""" + + DEFAULT_ITERS = 100 + + @classmethod + def setUpClass(cls): + cls.iter_seed_override = _parse_int_env(ITER_SEED_ENV) + if cls.iter_seed_override is not None: + cls.master_seed = None + cls.iters = 1 + sys.stderr.write( + f'>>>> polars dataframe fuzz: ' + f'iter_seed_override={_format_seed(cls.iter_seed_override)}, ' + f'iters=1\n') + return + cls.master_seed = _derive_master_seed() + cls.iters = _parse_int_env(ITERS_ENV) or cls.DEFAULT_ITERS + sys.stderr.write( + f'>>>> polars dataframe fuzz: master_seed=' + f'{_format_seed(cls.master_seed)}, iters={cls.iters}\n') + + def setUp(self): + self.server = QwpAckServer() + self.server.start() + self.conf = ( + f'qwpws::addr=127.0.0.1:{self.server.port};' + 'pool_size=1;pool_max=1;pool_reap=manual;') + + def tearDown(self): + self.server.stop() + + def _seed_msg(self, iter_seed): + if self.master_seed is None: + return f'iter={_format_seed(iter_seed)}' + return (f'master={_format_seed(self.master_seed)}, ' + f'iter={_format_seed(iter_seed)}') + + def _master_label(self): + if self.master_seed is None: + return f'iter_seed_override={_format_seed(self.iter_seed_override)}' + return f'master_seed={_format_seed(self.master_seed)}' + + def _check_one(self, client, df, kwargs, expected_supported, n_rows, + iter_seed, prev_binary_frames): + try: + client.dataframe(df, **kwargs) + except (qi.UnsupportedDataFrameShapeError, qi.IngressError) as exc: + self.assertFalse( + expected_supported, + f'client rejected an expected-supported frame; ' + f'{self._seed_msg(iter_seed)}: {exc}') + # Under a small max_rows_per_batch a mid-stream reject can have + # already flushed earlier batches, so we don't assert the count. + return self.server.snapshot()['binary_frames'] + self.assertTrue( + expected_supported, + f'client accepted an expected-rejected frame; ' + f'{self._seed_msg(iter_seed)}') + cur = self.server.snapshot()['binary_frames'] + if n_rows == 0: + self.assertEqual( + cur, prev_binary_frames, + f'empty df published a binary frame; ' + f'{self._seed_msg(iter_seed)}') + else: + self.assertGreater( + cur, prev_binary_frames, + f'accepted non-empty df published no binary frame; ' + f'{self._seed_msg(iter_seed)}') + return cur + + def _iter_seeds(self): + if self.iter_seed_override is not None: + return [self.iter_seed_override] + master = Rng(self.master_seed) + return [master.next_long() for _ in range(self.iters)] + + def test_fuzz_round_trip(self): + seeds = self._iter_seeds() + client = qi.Client.from_conf(self.conf) + failures = [] + try: + prev = 0 + for iter_seed in seeds: + rng = Rng(iter_seed) + try: + df, kwargs, expected_supported, n_rows = _build_frame(rng) + prev = self._check_one( + client, df, kwargs, expected_supported, n_rows, + iter_seed, prev) + except AssertionError as exc: + failures.append((iter_seed, type(exc).__name__, str(exc))) + prev = self.server.snapshot()['binary_frames'] + except Exception as exc: # noqa: BLE001 — fuzz triage + failures.append( + (iter_seed, type(exc).__name__, repr(exc))) + prev = self.server.snapshot()['binary_frames'] + finally: + client.close() + + stats = self.server.snapshot() + self.assertEqual( + stats['errors'], [], + f'server saw protocol errors: {stats["errors"]}; ' + f'{self._master_label()}') + + if failures: + preview = '\n'.join( + f' iter={_format_seed(s)} [{cls}]: {m}' + for s, cls, m in failures[:5]) + self.fail( + f'{len(failures)}/{len(seeds)} iterations failed.\n' + f'{self._master_label()}\n{preview}') + + +@unittest.skipUnless(pl is not None, 'polars not installed') +class TestClientPolarsDataframeSfaFuzz(TestClientPolarsDataframeFuzz): + """Same polars fuzz generator, through the columnar SFA backend.""" + + DEFAULT_ITERS = 25 + + def setUp(self): + self.server = QwpAckServer() + self.server.start() + self._sf_tmp = tempfile.TemporaryDirectory( + prefix='client-polars-sfa-fuzz-') + self.sender_id = 'py-pl-fuzz-' + uuid.uuid4().hex[:8] + self.conf = _sfa_conf( + self.server.port, + self.sender_id, + self._sf_tmp.name) + + def tearDown(self): + try: + self.assertEqual( + _sfa_file_count(self._sf_tmp.name, self.sender_id), + 0, + f'SFA files left after polars dataframe fuzz; ' + f'{self._master_label()}') + finally: + self.server.stop() + self._sf_tmp.cleanup() + + +def _norm_col(series): + out = [] + for v in series.to_list(): + if isinstance(v, (int, float)) and not isinstance(v, bool): + out.append(float(v)) + else: + out.append(v) + return out + + +def _val_match(a, b): + if a is None or b is None: + return a is None and b is None + if isinstance(a, float) and isinstance(b, float): + return math.isclose(a, b, rel_tol=1e-9, abs_tol=1e-12) + return a == b + + +class _RunningQuestDb: + """Adapter over an already-running QuestDB (its HTTP host:port) so the + round-trip can target a live instance for debugging.""" + + def __init__(self, host, port): + self.host = host + self.http_server_port = port + + def http_sql_query(self, sql_query): + import json + import urllib.error + import urllib.parse + import urllib.request + url = (f'http://{self.host}:{self.http_server_port}/exec?' + + urllib.parse.urlencode({'query': sql_query})) + try: + buf = urllib.request.urlopen(url, timeout=10).read() + except urllib.error.HTTPError as exc: + buf = exc.read() + data = json.loads(buf) + if 'error' in data: + raise RuntimeError(data['error']) + return data + + def stop(self): + pass + + +@unittest.skipUnless(pl is not None, 'polars not installed') +class TestClientPolarsDataframeRoundTrip(unittest.TestCase): + """Ingest a random polars frame via ``Client.dataframe()`` → real + QuestDB → read back via ``Client.query`` → assert value equivalence. + + Point at a running QuestDB with ``QDB_HTTP_ADDR=host:port`` (handy for + debugging), or set ``QDB_REPO_PATH=/path/to/questdb`` to spawn a + class-scoped fixture. Tables are dropped between iterations. Write and + read-back both stay in polars (pyarrow-free); the comparison is + value-by-value to tolerate QuestDB's widening (SYMBOL, LONG, ...).""" + + DEFAULT_ITERS = 8 + + @classmethod + def setUpClass(cls): + addr = os.environ.get('QDB_HTTP_ADDR') + if addr: + host, _, port = addr.partition(':') + cls.qdb = _RunningQuestDb(host or 'localhost', int(port or '9000')) + cls._owns_qdb = False + else: + repo = os.environ.get('QDB_REPO_PATH') + if not repo: + raise unittest.SkipTest( + 'set QDB_HTTP_ADDR=host:port for a running QuestDB, ' + 'or QDB_REPO_PATH=/path/to/questdb to spawn one') + import importlib + import pathlib + import shutil + cls._fixture_mod = importlib.import_module('fixture') + install_path = cls._fixture_mod.install_questdb_from_repo( + pathlib.Path(repo)) + plain_dir = PROJ_ROOT / 'build' / 'questdb' / 'layer3_polars' + plain_dir.mkdir(parents=True, exist_ok=True) + shutil.copytree(install_path, plain_dir, dirs_exist_ok=True) + cls.qdb = cls._fixture_mod.QuestDbFixture( + plain_dir, auth=False, http=True) + cls.qdb.start() + cls._owns_qdb = True + + cls.iter_seed_override = _parse_int_env(ITER_SEED_ENV) + if cls.iter_seed_override is not None: + cls.master_seed = None + cls.iters = 1 + else: + cls.master_seed = _derive_master_seed() + cls.iters = _parse_int_env(ITERS_ENV) or cls.DEFAULT_ITERS + sys.stderr.write( + f'>>>> polars round-trip fuzz vs real QuestDB: ' + f'master=' + f'{_format_seed(cls.master_seed) if cls.master_seed else "n/a"}, ' + f'iter_override=' + f'{_format_seed(cls.iter_seed_override) if cls.iter_seed_override else "n/a"}, ' + f'iters={cls.iters}\n') + + @classmethod + def tearDownClass(cls): + if getattr(cls, '_owns_qdb', False) and getattr(cls, 'qdb', None): + cls.qdb.stop() + + @property + def conf(self): + return f'qwpws::addr={self.qdb.host}:{self.qdb.http_server_port};' + + def _wait_for_rows(self, table_name, expected, timeout_s=30): + deadline = time.monotonic() + timeout_s + while time.monotonic() < deadline: + try: + res = self.qdb.http_sql_query( + f'SELECT count() FROM {table_name}') + except Exception: + time.sleep(0.1) + continue + rows = res.get('dataset') or [] + if rows and rows[0][0] >= expected: + return + time.sleep(0.1) + raise RuntimeError( + f'WAL apply timed out: {expected} rows expected on {table_name}') + + def _drop_table(self, table_name): + try: + self.qdb.http_sql_query(f'DROP TABLE IF EXISTS {table_name}') + except Exception: + pass + + # Lossless values only: INT64_MIN / NaN alias QuestDB LONG / DOUBLE null. + def _build_simple_frame(self, rng): + n_rows = max(rng.choice(ROW_COUNT_CHOICES), 1) + base = datetime.datetime(2024, 1, 1) + cols = { + 'ts': pl.Series( + [base + datetime.timedelta(seconds=i) for i in range(n_rows)], + dtype=pl.Datetime('us')), + 'id': pl.Series(list(range(1, n_rows + 1)), dtype=pl.Int64), + } + shape = rng.choice(['numeric', 'string', 'categorical', 'mixed']) + if shape in ('numeric', 'mixed'): + cols['price'] = pl.Series( + [rng.uniform(-1e6, 1e6) for _ in range(n_rows)], + dtype=pl.Float64) + cols['count'] = pl.Series( + [int(rng.uniform(-(1 << 50), 1 << 50)) for _ in range(n_rows)], + dtype=pl.Int64) + if shape in ('string', 'mixed'): + cols['note'] = pl.Series( + _random_strings(rng, n_rows, 8, 0.0, ascii_only=True, + empty_prob=0.0), + dtype=pl.Utf8) + if shape in ('categorical', 'mixed'): + pool = list(dict.fromkeys( + _random_strings(rng, 16, 6, 0.0, ascii_only=True, + empty_prob=0.0))) + while len(pool) < 2: + pool.append(f'p{len(pool)}') + null_prob = 0.2 if rng.next_bool() else 0.0 + vals = [None if (null_prob and rng.chance(null_prob)) + else pool[rng.next_int(len(pool))] + for _ in range(n_rows)] + cols['sym'] = pl.Series(vals, dtype=pl.Categorical) + return pl.DataFrame(cols), shape, n_rows + + def _iter_seeds(self): + if self.iter_seed_override is not None: + return [self.iter_seed_override] + master = Rng(self.master_seed) + return [master.next_long() for _ in range(self.iters)] + + def test_round_trip(self): + seeds = self._iter_seeds() + failures = [] + for iter_idx, iter_seed in enumerate(seeds): + rng = Rng(iter_seed) + shape = '?' + table_name = f'plrt_{iter_idx}_{iter_seed:016x}' + try: + df, shape, n_rows = self._build_simple_frame(rng) + self._drop_table(table_name) + with qi.Client.from_conf(self.conf) as client: + client.dataframe(df, table_name=table_name, at='ts') + self._wait_for_rows(table_name, n_rows) + + cols = [c for c in df.columns if c != 'ts'] + df_in = df.select(cols).sort('id') + sql = (f"SELECT {','.join(cols)} FROM {table_name} " + f"ORDER BY id") + with qi.Client.from_conf(self.conf) as client: + df_out = client.query(sql).to_polars().sort('id') + + mismatch = None + for c in sorted(cols): + a = _norm_col(df_in.get_column(c)) + b = _norm_col(df_out.get_column(c)) + if len(a) != len(b): + mismatch = f'{c}: {len(a)} vs {len(b)} rows' + break + for i, (x, y) in enumerate(zip(a, b)): + if not _val_match(x, y): + mismatch = f'{c}[{i}]: {x!r} != {y!r}' + break + if mismatch: + break + if mismatch: + raise AssertionError(mismatch) + except Exception as exc: # noqa: BLE001 — fuzz triage + failures.append( + (iter_seed, shape, type(exc).__name__, repr(exc))) + self._drop_table(table_name) + + if failures: + preview = '\n'.join( + f' iter={_format_seed(s)} shape={sh} [{cls}]: {m}' + for s, sh, cls, m in failures[:5]) + self.fail( + f'{len(failures)}/{len(seeds)} iterations failed.\n' + f'(showing first 5)\n{preview}') + + +if __name__ == '__main__': + unittest.main() diff --git a/test/test_dataframe.py b/test/test_dataframe.py index 0bde05cf..2fbc469b 100644 --- a/test/test_dataframe.py +++ b/test/test_dataframe.py @@ -1,7 +1,8 @@ #!/usr/bin/env python3 -import sys import os +import sys + sys.dont_write_bytecode = True import unittest import datetime as dt @@ -21,8 +22,6 @@ import pytz _TZ = pytz.timezone('America/New_York') -import patch_path - import questdb.ingress as qi import pandas as pd import numpy as np @@ -394,6 +393,610 @@ def test_row_of_nulls(self): qi.IngressError, 'Bad dataframe row.*1: All values are nulls.'): _dataframe(self.version, df, table_name='tbl1', symbols=['a'], at=qi.ServerTimestamp) + def test_planning_error_keeps_existing_buffer(self): + buf = qi.Buffer(protocol_version=self.version) + buf.dataframe( + pd.DataFrame({'a': [1]}), + table_name='tbl1', + at=qi.ServerTimestamp) + before = bytes(buf) + + with self.assertRaisesRegex( + qi.IngressError, + "`symbols`: Bad dtype `int64`.*'a'.*Must be a strings column."): + buf.dataframe( + pd.DataFrame({'a': [1]}), + table_name='tbl2', + symbols=['a'], + at=qi.ServerTimestamp) + + self.assertEqual(bytes(buf), before) + + def test_debug_dataframe_plan_fixed_table_and_timestamp_column(self): + df = pd.DataFrame({ + 'ts': pd.Series([ + pd.Timestamp('2024-01-01 00:00:00'), + pd.Timestamp('2024-01-01 00:00:01')], dtype='datetime64[ns]'), + 'seq': pd.Series([1, 2], dtype='int64'), + 'price': pd.Series([10.5, 11.5], dtype='float64'), + }) + + plan = qi._debug_dataframe_plan( + df, table_name='trades', at='ts', symbols=False) + cols = {col['orig_name']: col for col in plan['cols']} + + self.assertEqual(plan['row_count'], 2) + self.assertEqual(plan['col_count'], 3) + self.assertEqual(plan['fixed_table_name'], 'trades') + self.assertEqual(plan['at_value'], 'column') + self.assertEqual(cols['seq']['target'], 'integer') + self.assertEqual(cols['seq']['target_name'], 'seq') + self.assertEqual(cols['price']['target'], 'float') + self.assertEqual(cols['price']['target_name'], 'price') + self.assertEqual(cols['ts']['target'], 'designated timestamp') + self.assertIsNone(cols['ts']['target_name']) + self.assertEqual( + _dataframe(1, df, table_name='trades', at='ts', symbols=False), + b'trades seq=1i,price=10.5 1704067200000000000\n' + b'trades seq=2i,price=11.5 1704067201000000000\n') + + def test_debug_dataframe_plan_handles_zero_row_dataframe(self): + df = pd.DataFrame({ + 'ts': pd.Series([], dtype='datetime64[ns]'), + 'seq': pd.Series([], dtype='int64'), + }) + + row_plan = qi._debug_dataframe_plan( + df, table_name='trades', at='ts') + columnar_plan = qi._debug_dataframe_columnar_plan( + df, table_name='trades', at='ts') + + self.assertEqual(row_plan['row_count'], 0) + self.assertEqual(row_plan['col_count'], 0) + self.assertEqual(row_plan['cols'], []) + self.assertTrue(columnar_plan['supported']) + self.assertEqual(columnar_plan['failures'], []) + self.assertEqual(columnar_plan['normalizations'], []) + + def test_debug_dataframe_plan_table_column_and_auto_symbol(self): + df = pd.DataFrame({ + 'tbl': ['t1', 't2'], + 'sym': pd.Categorical(['a', 'b']), + 'value': pd.Series([1, 2], dtype='int64'), + 'ts': pd.Series([ + pd.Timestamp('2024-01-01 00:00:00'), + pd.Timestamp('2024-01-01 00:00:01')], dtype='datetime64[ns]'), + }) + + plan = qi._debug_dataframe_plan(df, table_name_col='tbl', at='ts') + cols = {col['orig_name']: col for col in plan['cols']} + + self.assertIsNone(plan['fixed_table_name']) + self.assertEqual(plan['at_value'], 'column') + self.assertEqual(cols['tbl']['target'], 'table name') + self.assertIsNone(cols['tbl']['target_name']) + self.assertEqual(cols['sym']['target'], 'symbol') + self.assertEqual(cols['sym']['target_name'], 'sym') + self.assertEqual(cols['value']['target'], 'integer') + self.assertEqual(cols['value']['target_name'], 'value') + self.assertEqual(cols['ts']['target'], 'designated timestamp') + self.assertEqual( + _dataframe(1, df, table_name_col='tbl', at='ts'), + b't1,sym=a value=1i 1704067200000000000\n' + b't2,sym=b value=2i 1704067201000000000\n') + + def test_debug_dataframe_plan_reuses_row_path_validation(self): + df = pd.DataFrame({'a': [1]}) + with self.assertRaisesRegex( + qi.IngressError, + "`symbols`: Bad dtype `int64`.*'a'.*Must be a strings column."): + qi._debug_dataframe_plan( + df, + table_name='tbl1', + symbols=['a'], + at=qi.ServerTimestamp) + + def test_debug_dataframe_columnar_plan_accepts_v1_numeric_core(self): + df = pd.DataFrame({ + 'ts': pd.Series([ + pd.Timestamp('2024-01-01 00:00:00'), + pd.Timestamp('2024-01-01 00:00:01')], dtype='datetime64[ns]'), + 'seq': pd.Series([1, 2], dtype='int64'), + 'price': pd.Series([10.5, 11.5], dtype='float64'), + }) + + plan = qi._debug_dataframe_columnar_plan( + df, table_name='trades', at='ts', symbols=False) + + self.assertTrue(plan['supported']) + self.assertEqual(plan['failures'], []) + + def test_columnar_plan_populates_plain_arrow_uint32_as_integer(self): + df = pd.DataFrame({ + 'ts': pd.Series([ + pd.Timestamp('2024-01-01 00:00:00'), + pd.Timestamp('2024-01-01 00:00:01')], + dtype='datetime64[ns]'), + 'seq': pd.Series( + pa.array([1, 4294967295], type=pa.uint32()), + dtype=pd.ArrowDtype(pa.uint32())), + }) + + plan = qi._debug_dataframe_columnar_plan( + df, table_name='trades', at='ts', symbols=False) + self.assertTrue(plan['supported'], plan['failures']) + result = qi._bench_dataframe_plan_and_populate_column_chunks( + df, table_name='trades', at='ts', symbols=False) + + self.assertEqual(result['populated_rows_total'], 2) + self.assertEqual(result['row_path_cell_emissions'], 0) + + def test_columnar_plan_accepts_arrow_wide_numeric_sources(self): + df = pd.DataFrame({ + 'ts': pd.Series([ + pd.Timestamp('2024-01-01 00:00:00'), + pd.Timestamp('2024-01-01 00:00:01'), + pd.Timestamp('2024-01-01 00:00:02')], + dtype='datetime64[ns]'), + 'arrow_i64': pd.Series( + pa.array([1, None, -3], type=pa.int64()), + dtype=pd.ArrowDtype(pa.int64())), + 'nullable_i64': pd.Series( + [4, pd.NA, -6], dtype=pd.Int64Dtype()), + 'arrow_f64': pd.Series( + pa.array([1.5, None, -3.25], type=pa.float64()), + dtype=pd.ArrowDtype(pa.float64())), + 'nullable_f64': pd.Series( + [4.5, pd.NA, -6.25], dtype=pd.Float64Dtype()), + }) + + plan = qi._debug_dataframe_columnar_plan( + df, table_name='trades', at='ts', symbols=False) + self.assertTrue(plan['supported'], plan['failures']) + result = qi._bench_dataframe_plan_and_populate_column_chunks( + df, table_name='trades', at='ts', symbols=False) + + self.assertEqual(result['populated_rows_total'], 3) + self.assertEqual(result['row_path_cell_emissions'], 0) + + def test_debug_dataframe_columnar_plan_accepts_narrow_numpy_dtypes(self): + # Step 3 broadened the columnar planner to accept every + # narrower NumPy numeric dtype + native bool. The shapes + # that were rejected pre-Step-3 (int8/16/32, uint*, float32, + # bool) all flow through the Rust widening / packing + # appender now. + df = pd.DataFrame({ + 'ts': pd.Series([ + pd.Timestamp('2024-01-01 00:00:00')], dtype='datetime64[ns]'), + 'narrow_int': pd.Series([1], dtype='int32'), + 'narrow_float': pd.Series([1.5], dtype='float32'), + 'native_bool': pd.Series([True], dtype='bool'), + 'u8': pd.Series([200], dtype='uint8'), + }) + + plan = qi._debug_dataframe_columnar_plan( + df, table_name='trades', at='ts', symbols=False) + + self.assertTrue(plan['supported']) + self.assertEqual(plan['failures'], []) + + def test_debug_dataframe_columnar_plan_accepts_tz_aware_timestamps(self): + # The columnar v1 planner was originally restricted to bare + # numpy datetime64[ns/us] for both the designated `at` column + # and `ts` field columns. The row path (Buffer.dataframe / + # Sender.dataframe) accepted tz-aware DatetimeTZDtype and + # pyarrow timestamp(unit, tz=...) all along; columnar v1 + # was tightened by accident. This test pins the symmetric + # contract: every datetime variant the row path accepts + # also passes the columnar planner. + cases = [ + # 1. pd.to_datetime(['...Z']) infers DatetimeTZDtype. + pd.to_datetime( + ['2024-01-01T00:00:00Z', '2024-01-01T00:00:01Z']), + # 2. Explicit DatetimeTZDtype with a non-UTC zone. + pd.Series( + [pd.Timestamp('2024-01-01 00:00:00', + tz='America/New_York'), + pd.Timestamp('2024-01-01 00:00:01', + tz='America/New_York')]), + # 3. ArrowDtype timestamp[us, tz=...]. + pd.Series( + [1700000000000000, 1700000001000000], + dtype=pd.ArrowDtype( + pa.timestamp('us', tz='UTC'))), + ] + for idx, ts_series in enumerate(cases): + with self.subTest(case=idx, dtype=str(ts_series.dtype)): + df = pd.DataFrame({ + 'ts': ts_series, + 'lg': pd.Series([1, 2], dtype='int64'), + }) + plan = qi._debug_dataframe_columnar_plan( + df, table_name='t', at='ts') + self.assertTrue( + plan['supported'], + f'case={idx} dtype={ts_series.dtype!r} ' + f'failures={plan["failures"]!r}') + + # 4. tz-aware as a field column (non-`at`), with tz-naive at=. + df = pd.DataFrame({ + 'ts': pd.Series( + [pd.Timestamp('2024-01-01'), + pd.Timestamp('2024-01-02')], dtype='datetime64[ns]'), + 'event_ts': pd.to_datetime( + ['2024-01-01T00:00:00Z', '2024-01-01T00:00:01Z']), + 'lg': pd.Series([1, 2], dtype='int64'), + }) + plan = qi._debug_dataframe_columnar_plan( + df, table_name='t', at='ts') + self.assertTrue( + plan['supported'], + f'tz-aware field column failures={plan["failures"]!r}') + + def test_debug_dataframe_columnar_plan_rejects_unsupported_shape(self): + df = pd.DataFrame({ + 'tbl': ['t1'], + 'sym': pd.Series(['a'], dtype='object'), + 'value': pd.Series([1], dtype='int64'), + 'ts': pd.Series([pd.NaT], dtype='datetime64[ns]'), + }) + + plan = qi._debug_dataframe_columnar_plan( + df, table_name_col='tbl', symbols=['sym'], at='ts') + reasons = [failure['reason'] for failure in plan['failures']] + + self.assertFalse(plan['supported']) + self.assertTrue(any('fixed table_name' in reason + for reason in reasons)) + self.assertTrue(any('Categorical or string[pyarrow]' in reason + for reason in reasons)) + self.assertTrue(any('cannot contain NaT' in reason + for reason in reasons)) + + def test_debug_dataframe_columnar_plan_accepts_v1_mixed_fast_paths(self): + df = pd.DataFrame({ + 'ts': pd.Series([ + pd.Timestamp('2024-01-01 00:00:00'), + pd.Timestamp('2024-01-01 00:00:01'), + pd.Timestamp('2024-01-01 00:00:02')], + dtype='datetime64[ns]'), + 'event_ts': pd.Series([ + pd.Timestamp('2024-01-02 00:00:00'), + pd.Timestamp('2024-01-02 00:00:01'), + pd.Timestamp('2024-01-02 00:00:02')], + dtype='datetime64[ns]'), + 'sym': pd.Categorical(['a', None, 'b']), + 'label': pd.Series( + pa.array(['alpha', None, 'gamma'], type=pa.string()), + dtype='string[pyarrow]'), + 'seq': pd.Series([1, 2, 3], dtype='int64'), + 'price': pd.Series([10.5, 11.5, 12.5], dtype='float64'), + }) + + plan = qi._debug_dataframe_columnar_plan( + df, table_name='trades', at='ts') + + self.assertTrue(plan['supported']) + self.assertEqual(plan['failures'], []) + + def test_debug_dataframe_columnar_plan_rejects_timestamp_only_frame(self): + df = pd.DataFrame({ + 'ts': pd.Series([ + pd.Timestamp('2024-01-01 00:00:00'), + pd.Timestamp('2024-01-01 00:00:01')], + dtype='datetime64[ns]'), + }) + + plan = qi._debug_dataframe_columnar_plan( + df, table_name='trades', at='ts') + + self.assertFalse(plan['supported']) + self.assertEqual( + [failure['reason'] for failure in plan['failures']], + ['v1 requires at least one non-timestamp data column.']) + with self.assertRaises(qi.UnsupportedDataFrameShapeError) as cm: + qi._bench_dataframe_plan_and_populate_column_chunks( + df, + table_name='trades', + at='ts') + self.assertEqual( + cm.exception.column_failures, + ({'column': None, + 'target': None, + 'source_code': None, + 'reason': 'v1 requires at least one non-timestamp data column.'},)) + + def test_debug_dataframe_columnar_plan_preserves_large_string(self): + df = pd.DataFrame({ + 'ts': pd.Series([ + pd.Timestamp('2024-01-01 00:00:00'), + pd.Timestamp('2024-01-01 00:00:01')], + dtype='datetime64[ns]'), + 'label': pd.Series( + pa.array(['alpha', 'beta'], type=pa.large_string()), + dtype=pd.ArrowDtype(pa.large_string())), + 'seq': pd.Series([1, 2], dtype='int64'), + }) + + row_plan = qi._debug_dataframe_plan( + df, table_name='trades', at='ts') + label_col = next( + col for col in row_plan['cols'] + if col['orig_name'] == 'label') + self.assertEqual(label_col['source_code'], 406000) + self.assertFalse(label_col['large_string_cast_to_utf8']) + + plan = qi._debug_dataframe_columnar_plan( + df, table_name='trades', at='ts') + + self.assertTrue(plan['supported']) + self.assertEqual(plan['failures'], []) + self.assertEqual(plan['normalizations'], []) + + def test_debug_dataframe_columnar_plan_preserves_large_string_category(self): + symbols = pd.Series( + pa.array( + ['alpha', 'beta', None, 'alpha'], + type=pa.large_string()), + dtype=pd.ArrowDtype(pa.large_string())).astype('category') + df = pd.DataFrame({ + 'ts': pd.Series([ + pd.Timestamp('2024-01-01 00:00:00'), + pd.Timestamp('2024-01-01 00:00:01'), + pd.Timestamp('2024-01-01 00:00:02'), + pd.Timestamp('2024-01-01 00:00:03')], + dtype='datetime64[ns]'), + 'sym': symbols, + 'seq': pd.Series([1, 2, 3, 4], dtype='int64'), + }) + + row_plan = qi._debug_dataframe_plan( + df, table_name='trades', at='ts') + sym_col = next( + col for col in row_plan['cols'] + if col['orig_name'] == 'sym') + self.assertFalse(sym_col['large_string_cast_to_utf8']) + + plan = qi._debug_dataframe_columnar_plan( + df, table_name='trades', at='ts') + + self.assertTrue(plan['supported']) + self.assertEqual(plan['failures'], []) + self.assertEqual(plan['normalizations'], []) + + def test_bench_dataframe_plan_and_populate_column_chunks(self): + df = pd.DataFrame({ + 'ts': pd.Series([ + pd.Timestamp('2024-01-01 00:00:00'), + pd.Timestamp('2024-01-01 00:00:01')], dtype='datetime64[ns]'), + 'seq': pd.Series([1, 2], dtype='int64'), + 'price': pd.Series([10.5, 11.5], dtype='float64'), + }) + + result = qi._bench_dataframe_plan_and_populate_column_chunks( + df, + table_name='trades', + at='ts', + symbols=False, + iterations=3) + + self.assertEqual(result['iterations'], 3) + self.assertEqual(result['row_count'], 2) + self.assertEqual(result['col_count'], 3) + self.assertEqual(result['logical_cells'], 6) + self.assertEqual(result['populated_chunks'], 3) + self.assertEqual(result['last_populated_rows'], 2) + self.assertEqual(result['row_path_cell_emissions'], 0) + + def test_bench_dataframe_plan_and_populate_splits_chunks(self): + df = pd.DataFrame({ + 'ts': pd.Series([ + pd.Timestamp('2024-01-01 00:00:00'), + pd.Timestamp('2024-01-01 00:00:01'), + pd.Timestamp('2024-01-01 00:00:02')], dtype='datetime64[ns]'), + 'seq': pd.Series([1, 2, 3], dtype='int64'), + }) + + result = qi._bench_dataframe_plan_and_populate_column_chunks( + df, + table_name='trades', + at='ts', + symbols=False, + iterations=2, + max_rows_per_chunk=2) + + self.assertEqual(result['rows_per_chunk'], 2) + self.assertEqual(result['populated_chunks'], 4) + self.assertEqual(result['populated_rows_total'], 6) + self.assertEqual(result['last_populated_rows'], 1) + self.assertEqual(result['row_path_cell_emissions'], 0) + + def test_bench_dataframe_plan_and_populate_aligns_nullable_chunks(self): + df = pd.DataFrame({ + 'ts': pd.Series( + pd.date_range('2024-01-01', periods=10, freq='s'), + dtype='datetime64[ns]'), + 'sym': pd.Categorical( + ['a', None, 'b', 'c', None, 'a', 'b', 'c', 'a', None]), + 'seq': pd.Series(range(10), dtype='int64'), + }) + + result = qi._bench_dataframe_plan_and_populate_column_chunks( + df, + table_name='trades', + at='ts', + iterations=1, + max_rows_per_chunk=3) + + self.assertEqual(result['rows_per_chunk'], 8) + self.assertEqual(result['populated_chunks'], 2) + self.assertEqual(result['populated_rows_total'], 10) + self.assertEqual(result['last_populated_rows'], 2) + self.assertEqual(result['row_path_cell_emissions'], 0) + + def test_bench_dataframe_plan_reuses_arrow_import_across_three_chunks(self): + labels = [ + 'alpha', None, 'beta', 'gamma', + None, 'delta', 'epsilon', 'zeta', + 'eta', None, 'theta', 'iota', + 'kappa', 'lambda', None, 'mu', + 'nu', 'xi', None, 'omicron', + ] + df = pd.DataFrame({ + 'ts': pd.Series( + pd.date_range('2024-01-01', periods=20, freq='s'), + dtype='datetime64[ns]'), + 'sym': pd.Categorical(labels), + 'label': pd.Series( + pa.array(labels, type=pa.string()), + dtype='string[pyarrow]'), + 'seq': pd.Series(range(20), dtype='int64'), + }) + + result = qi._bench_dataframe_plan_and_populate_column_chunks( + df, + table_name='trades', + at='ts', + iterations=1, + max_rows_per_chunk=3) + + self.assertEqual(result['rows_per_chunk'], 8) + self.assertEqual(result['populated_chunks'], 3) + self.assertEqual(result['populated_rows_total'], 20) + self.assertEqual(result['last_populated_rows'], 4) + self.assertEqual(result['row_path_cell_emissions'], 0) + + def test_bench_dataframe_plan_and_populate_aligns_pyobj_chunks(self): + # Regression: PyObject-sourced columns can carry nulls (or + # always be bitmaps in the bool_pyobj case). The chunk-size + # planner must align to 8 even though the wrapping ArrowArray + # has null_count=0 (the pyobj wrapper hardcodes this). + # + # Without the fix, max_rows_per_chunk=3 would survive as 3 + # and the second chunk's row_offset=3 would trip the + # byte-aligned-offset check in the emit branch. + df = pd.DataFrame({ + 'ts': pd.Series( + pd.date_range('2024-01-01', periods=10, freq='s'), + dtype='datetime64[ns]'), + 'obj_str': pd.Series( + ['a', None, 'b', 'c', None, 'a', 'b', 'c', 'a', None], + dtype='object'), + 'seq': pd.Series(range(10), dtype='int64'), + }) + + result = qi._bench_dataframe_plan_and_populate_column_chunks( + df, + table_name='trades', + at='ts', + iterations=1, + max_rows_per_chunk=3) + + self.assertEqual(result['rows_per_chunk'], 8) + self.assertEqual(result['populated_chunks'], 2) + self.assertEqual(result['populated_rows_total'], 10) + self.assertEqual(result['row_path_cell_emissions'], 0) + + def test_bench_dataframe_plan_and_populate_aligns_bool_pyobj_chunks(self): + # bool_pyobj always builds a bitmap of values; the emit + # offsets by row_offset // 8 regardless of nulls, so the + # planner must require 8-row alignment for this source too. + df = pd.DataFrame({ + 'ts': pd.Series( + pd.date_range('2024-01-01', periods=10, freq='s'), + dtype='datetime64[ns]'), + 'flag': pd.Series( + [True, False, True, True, False] * 2, + dtype='object'), + 'seq': pd.Series(range(10), dtype='int64'), + }) + + result = qi._bench_dataframe_plan_and_populate_column_chunks( + df, + table_name='trades', + at='ts', + iterations=1, + max_rows_per_chunk=5) + + self.assertEqual(result['rows_per_chunk'], 8) + self.assertEqual(result['populated_chunks'], 2) + self.assertEqual(result['populated_rows_total'], 10) + self.assertEqual(result['row_path_cell_emissions'], 0) + + def test_bench_dataframe_plan_and_populate_binary_pyobj(self): + # Regression: a pandas `bytes`/object column forces the manual + # columnar planner, which previously rejected the binary target + # even though the build/populate path fully supports it. + df = pd.DataFrame({ + 'ts': pd.Series( + pd.date_range('2024-01-01', periods=4, freq='s'), + dtype='datetime64[ns]'), + 'blob': pd.Series( + [b'hello', b'', b'\x00\x01\x02', None], + dtype='object'), + 'seq': pd.Series(range(4), dtype='int64'), + }) + + result = qi._bench_dataframe_plan_and_populate_column_chunks( + df, + table_name='trades', + at='ts', + iterations=1, + max_rows_per_chunk=16384) + + self.assertEqual(result['populated_rows_total'], 4) + self.assertEqual(result['row_path_cell_emissions'], 0) + + def test_bench_dataframe_plan_and_populate_rejects_unsupported_shape(self): + # Step 3 made bool/int32/etc. supported. Pick a shape that + # remains rejected: NaT in the designated timestamp. + df = pd.DataFrame({ + 'ts': pd.Series([pd.NaT], dtype='datetime64[ns]'), + 'seq': pd.Series([1], dtype='int64'), + }) + + with self.assertRaisesRegex( + qi.UnsupportedDataFrameShapeError, + 'DataFrame is not supported'): + qi._bench_dataframe_plan_and_populate_column_chunks( + df, + table_name='trades', + at='ts', + symbols=False) + + def test_bench_dataframe_plan_and_populate_mixed_fast_paths(self): + df = pd.DataFrame({ + 'ts': pd.Series([ + pd.Timestamp('2024-01-01 00:00:00'), + pd.Timestamp('2024-01-01 00:00:01'), + pd.Timestamp('2024-01-01 00:00:02')], + dtype='datetime64[ns]'), + 'event_ts': pd.Series([ + pd.Timestamp('2024-01-02 00:00:00'), + pd.Timestamp('2024-01-02 00:00:01'), + pd.Timestamp('2024-01-02 00:00:02')], + dtype='datetime64[ns]'), + 'sym': pd.Categorical(['a', None, 'b']), + 'label': pd.Series( + pa.array(['alpha', None, 'gamma'], type=pa.string()), + dtype='string[pyarrow]'), + 'seq': pd.Series([1, 2, 3], dtype='int64'), + 'price': pd.Series([10.5, 11.5, 12.5], dtype='float64'), + }) + + result = qi._bench_dataframe_plan_and_populate_column_chunks( + df, + table_name='trades', + at='ts', + iterations=2) + + self.assertEqual(result['iterations'], 2) + self.assertEqual(result['row_count'], 3) + self.assertEqual(result['col_count'], 6) + self.assertEqual(result['populated_chunks'], 2) + self.assertEqual(result['last_populated_rows'], 3) + self.assertEqual(result['row_path_cell_emissions'], 0) + def test_u8_numpy_col(self): df = pd.DataFrame({'a': pd.Series([ 1, 2, 3, @@ -585,6 +1188,34 @@ def test_f64_numpy_col(self): b'tbl1 a' + _float_binary_bytes(float('NAN'), self.version == 1) + b'\n' + b'tbl1 a' + _float_binary_bytes(1.7976931348623157e308, self.version == 1) + b'\n') + def test_datetime_pyobj_column_matches_numpy(self): + ts = dt.datetime(2021, 1, 1, 12, 0, 0, 123456) + obj_df = pd.DataFrame({'ts': pd.Series([ts], dtype=object)}) + np_df = pd.DataFrame( + {'ts': pd.Series([ts]).astype('datetime64[us]')}) + obj_buf = _dataframe( + self.version, obj_df, table_name='tbl', at=qi.ServerTimestamp) + np_buf = _dataframe( + self.version, np_df, table_name='tbl', at=qi.ServerTimestamp) + self.assertNotEqual(obj_buf, b'') + self.assertEqual(obj_buf, np_buf) + + def test_datetime_pyobj_column_with_null_matches_numpy(self): + ts = dt.datetime(2021, 1, 1, 12, 0, 0) + obj_df = pd.DataFrame({ + 'sym': pd.Categorical(['a', 'b']), + 'ts': pd.Series([ts, None], dtype=object)}) + np_df = pd.DataFrame({ + 'sym': pd.Categorical(['a', 'b']), + 'ts': pd.Series([ts, pd.NaT]).astype('datetime64[us]')}) + self.assertEqual( + _dataframe( + self.version, obj_df, table_name='tbl', + at=qi.ServerTimestamp), + _dataframe( + self.version, np_df, table_name='tbl', + at=qi.ServerTimestamp)) + def test_decimal_pyobj_column(self): decimals = [ Decimal('123.45'), @@ -1722,6 +2353,30 @@ def test_cat_i8_symbol(self): self._test_cat_symbol(30) self._test_cat_symbol(127) + def test_cat_large_string_symbol(self): + df = pd.DataFrame({ + 'a': pd.Series( + pa.array( + ['alpha', 'beta', None, 'alpha'], + type=pa.large_string()), + dtype=pd.ArrowDtype(pa.large_string())).astype('category'), + 'b': [1, 2, 3, 4], + }) + + buf = _dataframe( + self.version, + df, + table_name='tbl1', + symbols=True, + at=qi.ServerTimestamp) + + self.assertEqual( + buf, + b'tbl1,a=alpha b=1i\n' + b'tbl1,a=beta b=2i\n' + b'tbl1 b=3i\n' + b'tbl1,a=alpha b=4i\n') + def test_cat_i16_symbol(self): self._test_cat_symbol(128) self._test_cat_symbol(4000) @@ -1897,7 +2552,7 @@ def df_eq(exp_df, deser_df, exp_dtypes): self.assertTrue(exp_df.equals(deser_df)) # fastparquet doesn't roundtrip with pyarrow parquet properly. - # It decays categories to object and UInt8 to float64. + # It decays categories to object/string and UInt8 to float64. # We need to set up special case expected results for that. fallback_exp_dtypes = [ np.dtype('O'), @@ -1906,13 +2561,23 @@ def df_eq(exp_df, deser_df, exp_dtypes): np.dtype('float64')] fallback_df = df.astype({'s': 'object', 'b': 'float64'}) + def fastparquet_pyarrow_expected(deser_df): + actual_dtypes = list(deser_df.dtypes) + if not isinstance(actual_dtypes[0], pd.StringDtype): + return fallback_df, fallback_exp_dtypes + + exp_dtypes = list(fallback_exp_dtypes) + exp_dtypes[0] = actual_dtypes[0] + return fallback_df.astype({'s': actual_dtypes[0]}), exp_dtypes + df_eq(df, pa2pa_df, exp_dtypes) if fp_wrote: pa2fp_df = pd.read_parquet(pa_parquet_path, engine='fastparquet') fp2pa_df = pd.read_parquet(fp_parquet_path, engine='pyarrow') fp2fp_df = pd.read_parquet(fp_parquet_path, engine='fastparquet') df_eq(df, pa2fp_df, exp_dtypes) - df_eq(fallback_df, fp2pa_df, fallback_exp_dtypes) + fp2pa_exp_df, fp2pa_exp_dtypes = fastparquet_pyarrow_expected(fp2pa_df) + df_eq(fp2pa_exp_df, fp2pa_df, fp2pa_exp_dtypes) df_eq(df, fp2fp_df, exp_dtypes) exp = ( diff --git a/test/test_dataframe_leaks.py b/test/test_dataframe_leaks.py index eab23314..03557bd6 100644 --- a/test/test_dataframe_leaks.py +++ b/test/test_dataframe_leaks.py @@ -1,43 +1,213 @@ +import sys +sys.dont_write_bytecode = True +import ctypes +import gc +import unittest + + +def _limit_malloc_arenas(): + # Pin glibc to one arena before the sender's threads spawn; per-thread + # arenas otherwise inflate RSS without a real leak. + if not sys.platform.startswith('linux'): + return + try: + ctypes.CDLL('libc.so.6', use_errno=False).mallopt(-8, 1) # M_ARENA_MAX + except (OSError, AttributeError): + pass + + +_limit_malloc_arenas() + import patch_path -patch_path.patch() -import pandas as pd import questdb.ingress as qi -import os, psutil -process = psutil.Process(os.getpid()) +try: + import numpy as np + import pandas as pd +except ImportError: + np = None + pd = None + +try: + import pyarrow as pa +except ImportError: + pa = None + +try: + import psutil + _PROCESS = psutil.Process() +except ImportError: + psutil = None + + +def _malloc_trim(): + # Return glibc per-thread arena free space to the OS so RSS reflects live + # memory; a real leak survives the trim. + if not sys.platform.startswith('linux'): + return + try: + ctypes.CDLL('libc.so.6', use_errno=False).malloc_trim(0) + except (OSError, AttributeError): + pass + + +def _rss(): + _malloc_trim() + return _PROCESS.memory_info().rss + + +def _assert_no_leak(test, work, warmup, measure): + # A real leak keeps a steady RSS slope; glibc/obmalloc arena retention + # fills then flattens, sometimes with a one-off mid-run spike. Compare the + # median per-window growth of the last windows against the first so a single + # transient spike can't read as a leak — judging the shape, not a size. + windows = 6 + per = max(1, measure // windows) + for _ in range(warmup): + work() + gc.collect() + prev = _rss() + growths = [] + for _ in range(windows): + for _ in range(per): + work() + gc.collect() + now = _rss() + growths.append(now - prev) + prev = now + half = max(1, windows // 2) + head = sorted(growths[:half])[half // 2] + tail = sorted(growths[-half:])[half // 2] + test.assertTrue( + tail <= 3 * 1024 * 1024 or tail * 2 <= head, + f'RSS not plateauing: per-window growth {growths} bytes over ' + f'{windows} windows of {per} iterations (head {head:.0f}, ' + f'tail {tail:.0f}); likely a leaked native buffer.') + + +@unittest.skipUnless(pd is not None, 'pandas not installed') +@unittest.skipUnless(pa is not None, 'pyarrow not installed') +@unittest.skipUnless(psutil is not None, 'psutil not installed') +class TestCategoricalArrowLeak(unittest.TestCase): + """Guards the hand-built dictionary ``ArrowArray``/``ArrowSchema`` for + pandas Categorical columns (``_dataframe_category_series_as_arrow``): + every malloc'd buffer must be freed by its ``release`` callback on both + the row path (``Buffer.dataframe`` -> ``col_t_release``) and the columnar + path (``Client.dataframe`` -> Rust import -> ``arrow_import_free``).""" + + ROWS = 4096 + + def _cat(self, n_cats, code_dtype, null_step=0, large_string=False): + codes = np.random.randint(0, n_cats, self.ROWS).astype(code_dtype) + if null_step: + codes[::null_step] = -1 + categories = [f'category_value_{i:05}' for i in range(n_cats)] + if large_string: + categories = pd.array( + categories, dtype=pd.ArrowDtype(pa.large_string())) + return pd.Series(pd.Categorical.from_codes( + codes, categories=pd.Index(categories))) + + def _frames(self): + ts = pd.Series( + pd.to_datetime(np.arange(self.ROWS), unit='s')) + v = pd.Series(np.arange(self.ROWS, dtype=np.int64)) + frames = [ + pd.DataFrame({'ts': ts, 'sym': self._cat(50, np.int8), 'v': v}), + pd.DataFrame({'ts': ts, 'sym': self._cat(50, np.int8, null_step=7), + 'v': v}), + pd.DataFrame({'ts': ts, 'sym': self._cat(300, np.int16, + null_step=11), 'v': v}), + ] + if pa is not None: + frames.append(pd.DataFrame({ + 'ts': ts, 'sym': self._cat(40, np.int8, large_string=True), + 'v': v})) + return frames + + def _assert_stable(self, work, warmup, measure): + _assert_no_leak(self, work, warmup, measure) + + def test_row_path_no_leak(self): + frames = self._frames() + + def work(): + for df in frames: + qi.Buffer.ilp(protocol_version=2).dataframe( + df, table_name='t', at=qi.ServerTimestamp) + + self._assert_stable(work, warmup=200, measure=4000) + + def test_columnar_path_no_leak(self): + from qwp_ws_ack_server import QwpAckServer + frames = self._frames() + with QwpAckServer() as server: + conf = (f'qwpws::addr=127.0.0.1:{server.port};' + 'pool_size=1;pool_max=1;pool_reap=manual;') + with qi.Client.from_conf(conf) as client: + def work(): + for df in frames: + client.dataframe( + df, table_name='t', at='ts', symbols='auto') + + self._assert_stable(work, warmup=150, measure=1800) + + +@unittest.skipUnless(pd is not None, 'pandas not installed') +@unittest.skipUnless(psutil is not None, 'psutil not installed') +class TestPyobjColumnarLeak(unittest.TestCase): + """Guards the calloc'd ``pyobj_built_t`` builders + (``_dataframe_columnar_build_{str,int,float,bool}_pyobj``) reached by + ``Client.dataframe`` for object-dtype columns: every native buffer + (data, validity bitmap, str byte arena) must be freed on the success + and all-valid (bitmap-dropped) paths, and the pooled connection must be + returned on every call.""" + + ROWS = 2048 + + def _frames(self): + n = self.ROWS + ts = pd.Series(pd.to_datetime(np.arange(n), unit='s')) -def get_rss(): - return process.memory_info().rss + def col(values, null_step): + return pd.Series( + [None if (null_step and i % null_step == 0) else v + for i, v in enumerate(values)], + dtype=object) + strs = [f'value_{i:06}' for i in range(n)] + ints = list(range(n)) + floats = [i * 0.5 for i in range(n)] + bools = pd.Series([bool(i & 1) for i in range(n)], dtype=object) + frames = [] + for null_step in (0, 7): + frames.append(pd.DataFrame({ + 'ts': ts, + 's': col(strs, null_step), + 'i': col(ints, null_step), + 'f': col(floats, null_step), + 'b': bools, + })) + return frames -def serialize_and_cleanup(): - # qi.Buffer(protocol_version=2).row( - # 'table_name', - # symbols={'x': 'a', 'y': 'b'}, - # columns={'a': 1, 'b': 2, 'c': 3}) - df = pd.DataFrame({ - 'a': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16], - 'b': [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], - 'c': [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]}) - qi.Buffer(protocol_version=2).dataframe(df, table_name='test', at=qi.ServerTimestamp) + def _assert_stable(self, work, warmup, measure): + _assert_no_leak(self, work, warmup, measure) + def test_pyobj_columnar_path_no_leak(self): + from qwp_ws_ack_server import QwpAckServer + frames = self._frames() + with QwpAckServer() as server: + conf = (f'qwpws::addr=127.0.0.1:{server.port};' + 'pool_size=1;pool_max=1;pool_reap=manual;') + with qi.Client.from_conf(conf) as client: + def work(): + for df in frames: + client.dataframe( + df, table_name='t', at='ts', symbols=False) -def main(): - warmup_count = 0 - for n in range(1000000): - if n % 1000 == 0: - print(f'[iter: {n:09}, RSS: {get_rss():010}]') - if n > warmup_count: - before = get_rss() - serialize_and_cleanup() - if n > warmup_count: - after = get_rss() - if after != before: - msg = f'RSS changed from {before} to {after} after {n} iters' - print(msg) + self._assert_stable(work, warmup=150, measure=1800) if __name__ == '__main__': - main() - \ No newline at end of file + unittest.main()