Skip to content

Memory-aware C exporter rewrite (no gRPC/Arrow-C++; OTLP/HTTP + nanoarrow)#108

Draft
iskakaushik wants to merge 11 commits into
mainfrom
otel-rewrite
Draft

Memory-aware C exporter rewrite (no gRPC/Arrow-C++; OTLP/HTTP + nanoarrow)#108
iskakaushik wants to merge 11 commits into
mainfrom
otel-rewrite

Conversation

@iskakaushik

@iskakaushik iskakaushik commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Summary

From-scratch C rewrite of the export path (the extension was C++), motivated by a production SIGABRT under memory exhaustion that forced 49s of database-wide crash recovery: gRPC's gpr_malloc called abort() on a NULL malloc under vm.overcommit_memory=2, and a shmem-attached bgworker dying by signal forces the postmaster to crash-recover the whole cluster.

The fix removes every abort()-reachable path from the process and makes telemetry-stack failure cost at most a bgworker restart. Design and rationale: OTEL_REWRITE_DESIGN.md.

What changed

  • Whole extension is now Cproject(LANGUAGES C); any .cc under src/ is a configure-time error. Drops opentelemetry-cpp/gRPC/protobuf/abseil/Arrow-C++; keeps openssl/lz4/zstd.
  • OTLP/HTTP exporter (hand-rolled HTTP/1.1, protobuf pinned to opentelemetry-proto v1.9.0) replaces the gRPC client — same wire payloads, no gRPC threads, no gpr_malloc.
  • nanoarrow + hand-built IPC replaces Arrow C++ — byte-compatible Arrow IPC (56-field schema, dictionary batches, ZSTD BUFFER framing), validated against pyarrow.
  • Preallocated, zero-allocation hot path — all export buffers (staging, encode, network, Arrow scratch, ZSTD ctx, flatcc emitter pages) reserved at worker start. Under strict overcommit the steady-state export can't even reach an allocation failure.
  • SIGABRT backstop — async-signal-safe _exit(1), so any residual abort (e.g. a vendored-lib assert) costs a bgworker restart, not cluster crash recovery. SIGSEGV/SIGBUS keep full crash semantics.
  • GUC consolidation — 8 interacting memory knobs → pg_stat_ch.memory_limit (default 160 MB = old-default equivalence point) + three -1=auto expert overrides; pg_stat_ch_memory() view; control 0.3 → 0.4.
  • Failure semantics — peek/commit two-phase dequeue so a collector outage no longer silently drops events; every failure feeds backoff (kills the prior 2/sec log flood); poison-batch valve; export_dropped counter.

Verification

  • All 19 TUs compile clean under the full strict warning set (incl. new -Wstrict-prototypes/-Wmissing-prototypes); extension links.
  • Abort-surface audit: zero C++ runtime symbols; no abort()/assert() reachable from extension code or from nanoarrow/flatcc (built NDEBUG+FLATCC_NO_ASSERT); only _exit is the SIGABRT backstop.
  • otlp_encode unit test: 541 checks, clean under ASan/UBSan. arrow_batch: pyarrow decoded byte-compat.
  • Adversarial multi-lens code review (abort/longjmp, memory-prealloc, failure-semantics, wire-compat, GUC/SQL/build), each finding independently verified: 14 confirmed, 13 fixed in this branch.

Known follow-ups

  • TAP tests t/015_guc_validation and t/023_drain_loop encode pre-consolidation GUC semantics and need updating; README GUC docs still describe old knobs.
  • One LOW review finding deferred: pg_stat_ch_memory() reports export_dropped (a count) in the budget_bytes column — a documented single-shape compromise; fixing it would ripple the SQL signature.
  • Local build/verification was on macOS (direct compile+link); the Linux vcpkg/CMake matrix runs here in CI.

Captures the design driving the C++->C exporter rewrite: the memory-aware
OTLP/HTTP + nanoarrow architecture, the preallocated zero-allocation memory
model, the OTLP/Arrow wire contracts, the failure semantics (peek/commit
two-phase dequeue, backoff, poison-batch valve), the memory_limit GUC
consolidation, and the SIGABRT backstop.  Motivated by a production SIGABRT
where a gRPC allocation aborted under strict overcommit and forced
database-wide crash recovery.
Checked-in amalgamation (not a submodule) generated with
bundle.py --with-ipc --with-flatcc --symbol-namespace PgStatCh, pinned in
third_party/nanoarrow/VERSION.  Provides Arrow array building plus the flatcc
runtime for hand-built IPC messages.  Built with NDEBUG + FLATCC_NO_ASSERT so
every failure path returns an error code instead of abort()/assert() — the
property the exporter rewrite depends on.
Pure clang-format reflow (K&R/PostgreSQL brace and wrapping style -> the
project's Google-derived .clang-format).  No logic changes.  These files were
never format-enforced before because the mise/CI globs only matched .cc.
Replace the C++ virtual exporter_interface.h with exporter.h: a function-pointer
ops table (connect/export_events/send_arrow/...), the PschExportStatus contract,
and the preallocated export-arena split shared by driver and backends.

Add otlp_encode.{h,c}: a hand-rolled, zero-allocation protobuf wire encoder for
OTLP logs pinned to opentelemetry-proto v1.9.0.  Single-pass nested messages via
fixed-width overlong varint length slots; overflow-safe (sticky flag, no partial
out-of-bounds writes); bounds-checked response/Status parsers.  PschPbMsgEnd flags
overflow rather than Assert()ing on an unbalanced slot, so an encoder bug cannot
SIGABRT the bgworker.
Replace the Arrow C++ ArrowBatchBuilder with a C builder producing byte-compatible
Arrow IPC streaming payloads (56-field schema, 13 dictionary batches, ZSTD BUFFER
compression, validated against pyarrow).  nanoarrow's encoder cannot emit
dictionary batches or compressed buffers, so arrow_ipc_emit.c hand-builds the
Message/Schema/RecordBatch/DictionaryBatch/BodyCompression flatbuffers via the
flatcc runtime.

All buffers, the ZSTD context, and dictionary memo tables are preallocated;
steady-state Append/Finish/Reset perform zero heap allocation.  flatcc_emitter_alloc
routes flatcc's emitter pages through a fixed pre-reserved pool so flatcc's
page-shrink heuristic cannot malloc/free on the hot path (force-included into
flatcc.c via CMake).  Length-clamp WARNINGs are rate-limited to 1/sec.
Replace the gRPC direct-proto exporter with a hand-rolled HTTP/1.1 client over
blocking sockets on the bgworker thread (OpenSSL when the endpoint is https),
emitting the identical OTLP ExportLogsServiceRequest payloads for both the
per-record and Arrow-passthrough paths.  Encode/network buffers and the constant
request-head prefix are preallocated; reconnects allocate but never per batch.

Removes gRPC entirely — and with it the gpr_malloc abort site and the background
gRPC threads in the shmem-attached bgworker.  Retry only on 429/502/503/504 and
connection drop (Retry-After honored); other 4xx/5xx are permanent.  otel_headers
values are rejected if they contain control characters (header-injection guard);
db.name/db.user clamp to NAMEDATALEN-1, matching the other paths.
C port of the clickhouse_exporter (clickhouse-c is already C underneath).  Static
column descriptor table replaces the C++ column-factory/registry; goto-cleanup
replaces the RAII guards; preallocated per-column buffers replace std::vector.
An in_flight flag forces a reconnect after a longjmp interrupts a wire exchange.
The cancel callback now also fires on ProcSignalBarrierPending so a DROP DATABASE
barrier is not blocked behind a flowing ClickHouse read.  Backend failure stats
are recorded by the driver, not here (avoids double-counting).
…ackstop

Port the export driver to C and harden the failure path (OTEL_REWRITE_DESIGN.md
section 5a):

- shmem.c gains two-phase consume: PschPeekEvents resolves DSA strings without
  freeing or advancing the tail; PschConsumeEvents frees and advances.  Slot
  dsa_pointers are cleared BEFORE freeing so a longjmp out of dsa_free leaks
  rather than double-frees on retry.
- Failure routing: ERR_CONN never consumes (events survive a collector outage);
  ERR_SEND requeues until a poison threshold; ERR_NOMEM/ERR_INTERNAL drop.  A
  mid-chunk Arrow failure counts already-delivered events as exported and only
  drops/requeues the undelivered remainder.  New export_dropped counter, distinct
  from enqueue overflow.  Every failure mode increments the backoff counter.
- bgworker.c installs a SIGABRT backstop (async-signal-safe write + _exit(1))
  so a residual abort costs a bgworker restart, not database-wide crash recovery;
  SIGSEGV/SIGBUS keep full crash semantics.  Drain loop is bounded per cycle so
  procsignal barriers are processed promptly.  Ring-sanity check at worker start.
Collapse 8 interacting memory knobs into one operator knob plus three
-1=auto expert overrides (OTEL_REWRITE_DESIGN.md section 6):

- pg_stat_ch.memory_limit (default 160 MB = verified equivalence point of the
  old defaults; -1 = opt-in auto from shared_buffers; min 32 MB so a configured
  value is honored rather than always auto-raised).
- queue_capacity/string_area_size/export_buffer_size default -1=auto; explicit
  queue_capacity is rounded up to a power of 2 and floored at the ring minimum.
- Overrides auto-raise the budget with a WARNING, never FATAL; resolved values
  are written back so SHOW reports effective sizes.  The intern HTAB is charged
  to the budget.  Deleted dead shims; legacy knobs (batch_max, otel_*) become
  hidden one-release compat bridges; otel_log_delay_ms -> export_timeout (1000 ms).
- pg_stat_ch_memory() view exposes per-component budget/source; control bumped
  to 0.4 with a migration; guc.out updated.
project(LANGUAGES C); any .cc/.cxx/.cpp under src/ is now a configure-time
FATAL_ERROR.  Drop opentelemetry-cpp/Arrow (and transitively gRPC/protobuf/
abseil) from vcpkg.json, leaving openssl/lz4/zstd; vcpkg is kept (3 deps) for
static, pinned release artifacts.  Remove the cxx_std_17 / -include libintl.h /
-Wglobal-constructors C++ scaffolding; pin C_STANDARD 17 + C_EXTENSIONS.
.clang-tidy drops google-*/modernize-* and adds bugprone-*; mise/CI format and
lint globs now cover .c.  CI drops g++/CXX settings.  Update CLAUDE.md and
README for the C/dependency story; retire the cpp-* skills.
otlp_encode_test.c: 541 known-answer/overflow/parser checks, clean under
ASan/UBSan.  arrow_batch_test.sh: builds the IPC payload for synthetic events
and validates schema + decoded values against pyarrow (the byte-compat oracle).
Comment thread src/export/arrow_batch.c
return (int)max;
}

static uint32 HashBytes(const char* s, int len) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can use postgres's hash function, hash_bytes

@JoshDreamland

Copy link
Copy Markdown
Contributor

Will investigate pulling this into new unified exporter arm

@serprex

serprex commented Jun 18, 2026

Copy link
Copy Markdown
Member

14 confirmed, 13 fixed in this branch.

so what's the one that's not fixed?

@JoshDreamland

Copy link
Copy Markdown
Contributor

Holy never mind. NanoArrow brings its own baggage, and I misunderstood how massive this PR is. I think we should just secure our arena interactions and then convert all aborts that do not poison the arena into plain exits. I'll investigate that afterward. Can you get Claude to make those 14 crashes accessible somewhere? They're worth addressing without... all this at once.

@serprex

serprex commented Jun 18, 2026

Copy link
Copy Markdown
Member

agreed, nanoarrow is why I abandoned my own foray into porting pg_stat_ch to C

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants