Performance improvements by SimonDanisch · Pull Request #69 · JuliaIO/MsgPack.jl

SimonDanisch · 2026-06-11T15:13:55Z

Faster packing/unpacking, less inference

Packing showed up in Bonito profiles mainly as allocations: every multi byte
write (ints, floats, length prefixes) allocates a Ref inside stdlib's
write(io, ::Real), which escape analysis can't elide before 1.12. So:

new write_be helper that writes primitives to IOBuffers without the Ref.
Packing 1000 Ints into a reused buffer goes from 745 allocs to 0, realistic
messages lose 16-55% of their allocations (numbers in PERF.md)
pack(x; sizehint=64) to pre-size the output buffer
unpack was compile heavy: _unpack_any now uses @nospecializeinfer
(~50ms less inference, ~15% faster for Any payloads since callers stop
union splitting over the return type), the struct decoder generates its
field matching per type instead of walking a fixed 33-way @nif, and
array/map lengths share one Int specialization instead of one per
UInt8/16/32 prefix

No changes to the format or unpack behavior, and there are new tests for
exactly that (bytes from the spec, same output on all IO paths and
sizehints, struct decoding edge cases like unknown/duplicate/missing keys).

While at it I updated the CI to the PkgTemplates standard - windows hasn't
been tested since the appveyor days, and the docs build has been broken for
years (pre-rewrite UUID in docs/Project.toml plus Documenter 0.22). Julia
compat is 1.10 now because of @nospecializeinfer, version goes to 1.3.0.

`write(io::IO, x::Real)` for non-byte primitives goes through `write(io, Ref(x), sizeof(x))`, and the Ref allocation isn't elided by escape analysis (≤Julia 1.12), so every msgpack integer/float wider than a byte costs one heap allocation. `write_be` inlines that path with a local Ref + GC.@preserve + unsafe_copyto! into the IOBuffer's data, which the compiler can SROA — zero allocations per primitive write. Also expose `sizehint` as a kwarg on `pack(x)` so callers that know the output will be large can skip the geometric resize sequence on first use. Idiomatic cleanup: `pack_format` and `_pack_integer` now `return nothing` explicitly instead of leaking the `Int` byte-count from `write` up through `pack_type` chains. No perf effect (the value was already being discarded) — just clearer contract for side-effect-only methods. PERF.md records the bench results: 1000 small Ints into a reused IOBuffer went 745 allocs → 0; protocol-level RTT median in bench/bench.jl improved ~5%, p99 latency ~9% on top of the existing Bonito-side optimisations. Full numbers and methodology in the doc.

`_unpack_array(io, n, T, strict)` and `_unpack_map(io, n, T, strict)` were getting compiled into 3 separate specializations per `T` — one each for n::UInt8, n::UInt16, n::UInt32 — because the size prefix coming out of unpack_format was its native width type. The bodies are structurally identical (n is just a counter); the fan-out was pure waste. Normalize n to Int at the call sites (Array16Format / Array32Format / ArrayFixFormat and Map16/32/Fix). Result: - 3 width specs per T → 1 spec per T (per array/map) - Cold MsgPack.unpack(bytes) inference: 261ms → 165ms (-37%) - ROOT total inference for one cold unpack: 7.34s → 5.99s (-1.35s, -18%) (the compiler avoids walking the redundant specialization branches) No runtime cost — for-loop bounds want Int anyway, the conversion is free. Roundtrip and 1048 MsgPack tests pass. Most relevant for hot-path consumers like Bonito that unpack many Dict{String, Any} envelopes per session — the Any-typed value cascade no longer pays 3x for every recursive container.

The 30+ branch dispatcher was forcing inference to compute a precise Union of 16 leaf types on every parametric call, costing ~50ms one-time and inflating downstream callers with union-splitting code at every recursive site. @nospecializeinfer tells inference to model the return as Any. The path is only entered for T===Any (typed unpacks reach leaves via unpack_type), so callers never relied on the precise union. Measured on a mixed (typed + Any) workload: - MsgPack-attributable inference: 284.8ms -> 166.4ms (-42%) - _unpack_any itself: 66.4ms -> 13.5ms (-80%) - Runtime on Any-typed payloads: ~15% faster (less call-site union code) - Runtime on typed/struct payloads: unchanged - 1048/1048 tests pass

@nif

Two changes to unpack_type for StructType: 1. Function-barrier split: extract strict and non-strict bodies into `unpack_struct_strict` and `unpack_struct`. The outer dispatcher is now trivial. Doesn't reduce inference much on its own (the strict body wasn't being instantiated for typical empty-strict callers anyway), but enables change 2. 2. `unpack_struct` is now `@generated`, emitting exactly fieldcount(T) field-match branches at expand time. The original `Base.@nif(33, ...)` forced inference to walk 33 branches regardless of T's actual field count. For a 3-field struct, that's 30 wasted branches per call. The generated code emits 3 branches with `unpack(io, fieldtype(T, i))` at each `i` literal, so each leaf resolves to a statically typed call. Also drops the closure capture `(args...) -> construct(T, args...)` from the strict body in favor of `Base.@nCall i construct T x` (which already splices T directly). The closure was creating a per-T method instance for no benefit. Measured on the same mixed (typed + Any) workload as previous commits: - MsgPack-attributable inference: 171ms -> 42ms (-75%) - StructType unpack hotspot: 24.9ms -> dropped out of top 8 - Runtime: within noise on all paths - 1048/1048 tests pass

Coverage for the perf changes on this branch: - spec-defined golden bytes for the multi-byte integer/float/str16 formats (round trips alone can't catch a writer/reader agreeing on the wrong byte order) - pack(x; sizehint) produces identical bytes across initial capacities - the IOBuffer fast path (append and non-append) and the generic IO fallback of write_be produce identical bytes - non-strict struct decoding (now generated per struct type): unknown keys skipped wherever they appear, key order independence, missing fields surfacing as FieldNotFound to construct, duplicate keys resolving to the last occurrence Bump version to 1.3.0: pack gained the sizehint keyword (additive); wire format and unpack semantics are unchanged, so nothing breaking.

- CI.yml: test lts/1/pre on linux/macos/windows (windows was untested since the appveyor days; 1.0/1.6/nightly tested neither current stable nor windows), julia-actions/cache instead of the hand-rolled actions/cache, concurrency group, coverage upload to codecov, and a docs job via julia-docdeploy - TagBot.yml: issue_comment trigger instead of hourly cron, explicit permissions, DOCUMENTER_KEY ssh input - add CompatHelper.yml - julia compat 1 -> 1.10: the branch uses Base.@nospecializeinfer (1.10+), so older julias were already broken; lts is the new floor - fix the docs project: it referenced MsgPack under a pre-rewrite UUID and pinned Documenter ~0.22 (2019), so the docs job cannot have run for years; now Documenter 1, build and doctests verified locally - README: add CI + codecov badges

SimonDanisch added 7 commits May 12, 2026 19:27

Merge branch 'master' into perf/write-be

f754a52

SimonDanisch closed this Jun 11, 2026

SimonDanisch reopened this Jun 11, 2026

SimonDanisch merged commit 7326736 into master Jun 11, 2026
22 of 24 checks passed

SimonDanisch deleted the perf/write-be branch June 11, 2026 19:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvements#69

Performance improvements#69
SimonDanisch merged 7 commits into
masterfrom
perf/write-be

SimonDanisch commented Jun 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SimonDanisch commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Faster packing/unpacking, less inference

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SimonDanisch commented Jun 11, 2026 •

edited

Loading