Performance improvements#69
Merged
Merged
Conversation
`write(io::IO, x::Real)` for non-byte primitives goes through `write(io, Ref(x), sizeof(x))`, and the Ref allocation isn't elided by escape analysis (≤Julia 1.12), so every msgpack integer/float wider than a byte costs one heap allocation. `write_be` inlines that path with a local Ref + GC.@preserve + unsafe_copyto! into the IOBuffer's data, which the compiler can SROA — zero allocations per primitive write. Also expose `sizehint` as a kwarg on `pack(x)` so callers that know the output will be large can skip the geometric resize sequence on first use. Idiomatic cleanup: `pack_format` and `_pack_integer` now `return nothing` explicitly instead of leaking the `Int` byte-count from `write` up through `pack_type` chains. No perf effect (the value was already being discarded) — just clearer contract for side-effect-only methods. PERF.md records the bench results: 1000 small Ints into a reused IOBuffer went 745 allocs → 0; protocol-level RTT median in bench/bench.jl improved ~5%, p99 latency ~9% on top of the existing Bonito-side optimisations. Full numbers and methodology in the doc.
`_unpack_array(io, n, T, strict)` and `_unpack_map(io, n, T, strict)` were
getting compiled into 3 separate specializations per `T` — one each for
n::UInt8, n::UInt16, n::UInt32 — because the size prefix coming out of
unpack_format was its native width type. The bodies are structurally
identical (n is just a counter); the fan-out was pure waste.
Normalize n to Int at the call sites (Array16Format / Array32Format /
ArrayFixFormat and Map16/32/Fix). Result:
- 3 width specs per T → 1 spec per T (per array/map)
- Cold MsgPack.unpack(bytes) inference: 261ms → 165ms (-37%)
- ROOT total inference for one cold unpack: 7.34s → 5.99s (-1.35s, -18%)
(the compiler avoids walking the redundant specialization branches)
No runtime cost — for-loop bounds want Int anyway, the conversion is free.
Roundtrip and 1048 MsgPack tests pass.
Most relevant for hot-path consumers like Bonito that unpack many
Dict{String, Any} envelopes per session — the Any-typed value cascade
no longer pays 3x for every recursive container.
The 30+ branch dispatcher was forcing inference to compute a precise Union of 16 leaf types on every parametric call, costing ~50ms one-time and inflating downstream callers with union-splitting code at every recursive site. @nospecializeinfer tells inference to model the return as Any. The path is only entered for T===Any (typed unpacks reach leaves via unpack_type), so callers never relied on the precise union. Measured on a mixed (typed + Any) workload: - MsgPack-attributable inference: 284.8ms -> 166.4ms (-42%) - _unpack_any itself: 66.4ms -> 13.5ms (-80%) - Runtime on Any-typed payloads: ~15% faster (less call-site union code) - Runtime on typed/struct payloads: unchanged - 1048/1048 tests pass
Two changes to unpack_type for StructType: 1. Function-barrier split: extract strict and non-strict bodies into `unpack_struct_strict` and `unpack_struct`. The outer dispatcher is now trivial. Doesn't reduce inference much on its own (the strict body wasn't being instantiated for typical empty-strict callers anyway), but enables change 2. 2. `unpack_struct` is now `@generated`, emitting exactly fieldcount(T) field-match branches at expand time. The original `Base.@nif(33, ...)` forced inference to walk 33 branches regardless of T's actual field count. For a 3-field struct, that's 30 wasted branches per call. The generated code emits 3 branches with `unpack(io, fieldtype(T, i))` at each `i` literal, so each leaf resolves to a statically typed call. Also drops the closure capture `(args...) -> construct(T, args...)` from the strict body in favor of `Base.@nCall i construct T x` (which already splices T directly). The closure was creating a per-T method instance for no benefit. Measured on the same mixed (typed + Any) workload as previous commits: - MsgPack-attributable inference: 171ms -> 42ms (-75%) - StructType unpack hotspot: 24.9ms -> dropped out of top 8 - Runtime: within noise on all paths - 1048/1048 tests pass
Coverage for the perf changes on this branch: - spec-defined golden bytes for the multi-byte integer/float/str16 formats (round trips alone can't catch a writer/reader agreeing on the wrong byte order) - pack(x; sizehint) produces identical bytes across initial capacities - the IOBuffer fast path (append and non-append) and the generic IO fallback of write_be produce identical bytes - non-strict struct decoding (now generated per struct type): unknown keys skipped wherever they appear, key order independence, missing fields surfacing as FieldNotFound to construct, duplicate keys resolving to the last occurrence Bump version to 1.3.0: pack gained the sizehint keyword (additive); wire format and unpack semantics are unchanged, so nothing breaking.
- CI.yml: test lts/1/pre on linux/macos/windows (windows was untested since the appveyor days; 1.0/1.6/nightly tested neither current stable nor windows), julia-actions/cache instead of the hand-rolled actions/cache, concurrency group, coverage upload to codecov, and a docs job via julia-docdeploy - TagBot.yml: issue_comment trigger instead of hourly cron, explicit permissions, DOCUMENTER_KEY ssh input - add CompatHelper.yml - julia compat 1 -> 1.10: the branch uses Base.@nospecializeinfer (1.10+), so older julias were already broken; lts is the new floor - fix the docs project: it referenced MsgPack under a pre-rewrite UUID and pinned Documenter ~0.22 (2019), so the docs job cannot have run for years; now Documenter 1, build and doctests verified locally - README: add CI + codecov badges
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Faster packing/unpacking, less inference
Packing showed up in Bonito profiles mainly as allocations: every multi byte
write (ints, floats, length prefixes) allocates a Ref inside stdlib's
write(io, ::Real), which escape analysis can't elide before 1.12. So:write_behelper that writes primitives to IOBuffers without the Ref.Packing 1000 Ints into a reused buffer goes from 745 allocs to 0, realistic
messages lose 16-55% of their allocations (numbers in PERF.md)
pack(x; sizehint=64)to pre-size the output buffer_unpack_anynow uses@nospecializeinfer(~50ms less inference, ~15% faster for Any payloads since callers stop
union splitting over the return type), the struct decoder generates its
field matching per type instead of walking a fixed 33-way
@nif, andarray/map lengths share one Int specialization instead of one per
UInt8/16/32 prefix
No changes to the format or unpack behavior, and there are new tests for
exactly that (bytes from the spec, same output on all IO paths and
sizehints, struct decoding edge cases like unknown/duplicate/missing keys).
While at it I updated the CI to the PkgTemplates standard - windows hasn't
been tested since the appveyor days, and the docs build has been broken for
years (pre-rewrite UUID in docs/Project.toml plus Documenter 0.22). Julia
compat is 1.10 now because of
@nospecializeinfer, version goes to 1.3.0.