Skip to content

perf(pine-java): pool OperatorOutput — requires nested-map → flat list refactor first #121

Description

@Liam0205

Context

#119 (PR #120) brought a profile-driven 7-9% ns/op + 19% B/op reduction to pine-go by pooling *OperatorOutput via sync.Pool in the scheduler, eliminating per-Execute alloc of OperatorOutput + the dominant itemWrites []ItemWrite grow cost.

The win does not transfer to pine-java. Java and Go runtimes are independent — sharing only the JSON contract and Apple DSL. #119 touched pine-go/internal/runtime/scheduler.go; nothing in that change reaches the JVM side.

Current Java state

pine-java/src/main/java/page/liam/pine/OperatorOutput.java uses:

private final Map<String, Object> commonWrites = new LinkedHashMap<>();
private final Map<Integer, Map<String, Object>> itemWrites = new HashMap<>();  // nested
private final List<Map<String, Object>> addedItems = new ArrayList<>();
private final Set<Integer> removedItems = new HashSet<>();

The blocker is structural, not lifetime-related:

Storage Per-Execute behaviour Pooling viability
itemWrites: Map<Integer, Map<String,Object>> (nested) Each setItem(i, field, v) does computeIfAbsent(i, k -> new LinkedHashMap<>()).put(field, v) — tree-node alloc on first touch per row, then per-cell map put Pooling the outer map keeps O(N) inner LinkedHashMaps live; clearing them on Reset is itself O(N×M) — worse than current alloc

pine-go had the same nested form historically (commit d238098 "replace nested map item writes with flat []ItemWrite slice", v0.7 era). The refactor unlocked everything downstream — including #119's pool. Java needs the same structural refactor first.

Suggested phasing

Phase 1 (refactor — separate PR, no pooling yet):

  • Introduce ItemWrite { int index; String field; Object value; } record
  • Change itemWrites to List<ItemWrite> (or ArrayList<ItemWrite> for capacity reuse later)
  • Update Engine.applyOutput / ColumnFrame.applyOutput / DataFrame.applyOutput / ParallelExecutor.mergeOutputs to iterate the flat list
  • Update Java fuzz / unit tests as needed; cross-validate's byte-equal /execute parity (scripts/cross-validate/02-engine-byte-exact.sh) gates correctness — no behaviour change should leak

Phase 2 (pool — once Phase 1 lands):

  • Reset() method analogous to Go's: null slot refs, truncate to size 0 (ArrayList.clear() retains capacity)
  • ThreadLocal<OperatorOutput> or ConcurrentLinkedDeque pool keyed at Engine instance
  • Expect 5-10% throughput improvement based on Go numbers + JVM GC overhead profile

Why JVM-specific concerns matter

JVM is not Go:

  • Short-lived objects often live in TLAB (Thread-Local Allocation Buffer), young gen, never reach old gen
  • Escape analysis can sometimes stack-allocate
  • BUT: nested LinkedHashMap allocs are heavy enough to escape TLAB on hot paths; profiling is required to confirm the win

Recommended approach: profile pine-java first via JMH (which #119 already noted is missing — see pine-java/benchmarks/'s placeholder note). If OperatorOutput.setItem map allocs dominate hot-path GC pressure, do Phase 1+2. If JVM is amortizing it away cheaply, defer indefinitely.

Risk

  • Behaviour parity: byte-equal /execute parity is gated by cross-validate/02-engine-byte-exact.sh. Both phases must preserve it.
  • Concurrency: Java's OperatorOutput is currently mutable-by-single-thread per Execute; the same contract must hold post-refactor.
  • Phase 1 LOC: ~80-120 lines (Java reformat of the nested-map walk is the bulk).

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions