Skip to content

Performance improvements#69

Merged
SimonDanisch merged 7 commits into
masterfrom
perf/write-be
Jun 11, 2026
Merged

Performance improvements#69
SimonDanisch merged 7 commits into
masterfrom
perf/write-be

Conversation

@SimonDanisch

@SimonDanisch SimonDanisch commented Jun 11, 2026

Copy link
Copy Markdown
Member

Faster packing/unpacking, less inference

Packing showed up in Bonito profiles mainly as allocations: every multi byte
write (ints, floats, length prefixes) allocates a Ref inside stdlib's
write(io, ::Real), which escape analysis can't elide before 1.12. So:

  • new write_be helper that writes primitives to IOBuffers without the Ref.
    Packing 1000 Ints into a reused buffer goes from 745 allocs to 0, realistic
    messages lose 16-55% of their allocations (numbers in PERF.md)
  • pack(x; sizehint=64) to pre-size the output buffer
  • unpack was compile heavy: _unpack_any now uses @nospecializeinfer
    (~50ms less inference, ~15% faster for Any payloads since callers stop
    union splitting over the return type), the struct decoder generates its
    field matching per type instead of walking a fixed 33-way @nif, and
    array/map lengths share one Int specialization instead of one per
    UInt8/16/32 prefix

No changes to the format or unpack behavior, and there are new tests for
exactly that (bytes from the spec, same output on all IO paths and
sizehints, struct decoding edge cases like unknown/duplicate/missing keys).

While at it I updated the CI to the PkgTemplates standard - windows hasn't
been tested since the appveyor days, and the docs build has been broken for
years (pre-rewrite UUID in docs/Project.toml plus Documenter 0.22). Julia
compat is 1.10 now because of @nospecializeinfer, version goes to 1.3.0.

`write(io::IO, x::Real)` for non-byte primitives goes through
`write(io, Ref(x), sizeof(x))`, and the Ref allocation isn't elided by
escape analysis (≤Julia 1.12), so every msgpack integer/float wider than
a byte costs one heap allocation. `write_be` inlines that path with a
local Ref + GC.@preserve + unsafe_copyto! into the IOBuffer's data, which
the compiler can SROA — zero allocations per primitive write.

Also expose `sizehint` as a kwarg on `pack(x)` so callers that know the
output will be large can skip the geometric resize sequence on first use.

Idiomatic cleanup: `pack_format` and `_pack_integer` now `return nothing`
explicitly instead of leaking the `Int` byte-count from `write` up
through `pack_type` chains. No perf effect (the value was already being
discarded) — just clearer contract for side-effect-only methods.

PERF.md records the bench results: 1000 small Ints into a reused
IOBuffer went 745 allocs → 0; protocol-level RTT median in
bench/bench.jl improved ~5%, p99 latency ~9% on top of the existing
Bonito-side optimisations. Full numbers and methodology in the doc.
`_unpack_array(io, n, T, strict)` and `_unpack_map(io, n, T, strict)` were
getting compiled into 3 separate specializations per `T` — one each for
n::UInt8, n::UInt16, n::UInt32 — because the size prefix coming out of
unpack_format was its native width type. The bodies are structurally
identical (n is just a counter); the fan-out was pure waste.

Normalize n to Int at the call sites (Array16Format / Array32Format /
ArrayFixFormat and Map16/32/Fix). Result:

- 3 width specs per T → 1 spec per T (per array/map)
- Cold MsgPack.unpack(bytes) inference: 261ms → 165ms (-37%)
- ROOT total inference for one cold unpack: 7.34s → 5.99s (-1.35s, -18%)
  (the compiler avoids walking the redundant specialization branches)

No runtime cost — for-loop bounds want Int anyway, the conversion is free.
Roundtrip and 1048 MsgPack tests pass.

Most relevant for hot-path consumers like Bonito that unpack many
Dict{String, Any} envelopes per session — the Any-typed value cascade
no longer pays 3x for every recursive container.
The 30+ branch dispatcher was forcing inference to compute a precise
Union of 16 leaf types on every parametric call, costing ~50ms one-time
and inflating downstream callers with union-splitting code at every
recursive site.

@nospecializeinfer tells inference to model the return as Any. The
path is only entered for T===Any (typed unpacks reach leaves via
unpack_type), so callers never relied on the precise union.

Measured on a mixed (typed + Any) workload:
- MsgPack-attributable inference: 284.8ms -> 166.4ms (-42%)
- _unpack_any itself: 66.4ms -> 13.5ms (-80%)
- Runtime on Any-typed payloads: ~15% faster (less call-site union code)
- Runtime on typed/struct payloads: unchanged
- 1048/1048 tests pass
Two changes to unpack_type for StructType:

1. Function-barrier split: extract strict and non-strict bodies into
   `unpack_struct_strict` and `unpack_struct`. The outer dispatcher is
   now trivial. Doesn't reduce inference much on its own (the strict
   body wasn't being instantiated for typical empty-strict callers
   anyway), but enables change 2.

2. `unpack_struct` is now `@generated`, emitting exactly fieldcount(T)
   field-match branches at expand time. The original `Base.@nif(33, ...)`
   forced inference to walk 33 branches regardless of T's actual field
   count. For a 3-field struct, that's 30 wasted branches per call.
   The generated code emits 3 branches with `unpack(io, fieldtype(T, i))`
   at each `i` literal, so each leaf resolves to a statically typed call.

Also drops the closure capture `(args...) -> construct(T, args...)` from
the strict body in favor of `Base.@nCall i construct T x` (which already
splices T directly). The closure was creating a per-T method instance for
no benefit.

Measured on the same mixed (typed + Any) workload as previous commits:
- MsgPack-attributable inference: 171ms -> 42ms (-75%)
- StructType unpack hotspot: 24.9ms -> dropped out of top 8
- Runtime: within noise on all paths
- 1048/1048 tests pass
Coverage for the perf changes on this branch:
- spec-defined golden bytes for the multi-byte integer/float/str16
  formats (round trips alone can't catch a writer/reader agreeing on
  the wrong byte order)
- pack(x; sizehint) produces identical bytes across initial capacities
- the IOBuffer fast path (append and non-append) and the generic IO
  fallback of write_be produce identical bytes
- non-strict struct decoding (now generated per struct type): unknown
  keys skipped wherever they appear, key order independence, missing
  fields surfacing as FieldNotFound to construct, duplicate keys
  resolving to the last occurrence

Bump version to 1.3.0: pack gained the sizehint keyword (additive);
wire format and unpack semantics are unchanged, so nothing breaking.
- CI.yml: test lts/1/pre on linux/macos/windows (windows was untested
  since the appveyor days; 1.0/1.6/nightly tested neither current
  stable nor windows), julia-actions/cache instead of the hand-rolled
  actions/cache, concurrency group, coverage upload to codecov, and a
  docs job via julia-docdeploy
- TagBot.yml: issue_comment trigger instead of hourly cron, explicit
  permissions, DOCUMENTER_KEY ssh input
- add CompatHelper.yml
- julia compat 1 -> 1.10: the branch uses Base.@nospecializeinfer
  (1.10+), so older julias were already broken; lts is the new floor
- fix the docs project: it referenced MsgPack under a pre-rewrite UUID
  and pinned Documenter ~0.22 (2019), so the docs job cannot have run
  for years; now Documenter 1, build and doctests verified locally
- README: add CI + codecov badges
@SimonDanisch SimonDanisch reopened this Jun 11, 2026
@SimonDanisch SimonDanisch merged commit 7326736 into master Jun 11, 2026
22 of 24 checks passed
@SimonDanisch SimonDanisch deleted the perf/write-be branch June 11, 2026 19:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant