CPU inference optimizations: fuse_custom() across all blocks + channels_last benchmark#201
Open
ayaanmustafa wants to merge 3 commits into
Open
CPU inference optimizations: fuse_custom() across all blocks + channels_last benchmark#201ayaanmustafa wants to merge 3 commits into
ayaanmustafa wants to merge 3 commits into
Conversation
PyTorch forbids setting requires_grad=False on non-leaf tensors. When fuse=True creates non-leaf parameters (e.g. from fused model checkpoints), the old code crashed with: RuntimeError: you can only change requires_grad flags of leaf variables. Fix: only set requires_grad=False on leaf tensors; for non-leaf tensors, detach via .data.detach() instead.
- Add fuse_custom() to 20+ block modules (SPP, SPPF, C1-C3, C2f, Bottleneck, BottleneckCSP, ResNetBlock, RepC3, RepCSP, RepNCSPELAN4, ADown, SPPELAN, HGStem, HGBlock, Proto, GhostBottleneck, RepBottleneck, C2fAttn, Attention, PSA, SCDown, CIB, C2fCIB, G2L_CRM, DilatedBlock, DilatedBottleneck) - Optimize CBFuse: use sum() instead of torch.sum(torch.stack()) - Optimize DilatedBlock: handle fused checkpoints (nn.Identity bn) - Fix RepVGGDW.forward to support fused state (conv1 deleted) - Fix MaxSigmoidAttnBlock.ec type (was Conv, should be int) - Add threads config field to default.yaml for CPU inference tuning
benchmark_cpu_inference.py tests --channels-last and --fuse as opt-in toggles, measuring raw forward and full predict pipeline. No hardcoded paths; uses argparse for model/image paths.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Three commits that add a recursive
fuse_custom()path to DocLayout-YOLO'sinference stack and ship a CPU benchmark to validate the gains. No model
architecture, training loop, or detection behavior is changed — the speedup
comes from fusing Conv+BN (and friends) and letting the benchmark toggle
NHWC memory format.
What's in this PR
1.
fix(autobackend): handle non-leaf parameters in fused checkpointsdoclayout_yolo/nn/autobackend.py— when loading a fused checkpoint,the loader tried to flip
requires_grad=Falseon parameters thatweren't leaf tensors and crashed with:
RuntimeError: you can only change requires_grad flags of leaf variables.Now we only set
requires_gradon leaf tensors; non-leaf tensors get.data.detach()instead. LetsYOLOv10("model.pt")actually load afused checkpoint end-to-end.
2.
feat(fuse): add fuse_custom() to all block modules + CPU thread configdoclayout_yolo/nn/modules/block.py— addsfuse_custom()(safe,idempotent, guarded by a
fusedflag) to:SPP, SPPF, C1, C2, C3, C2f, Bottleneck, BottleneckCSP, ResNetBlock,
RepC3, RepCSP, RepNCSPELAN4, ADown, SPPELAN, HGStem, HGBlock, Proto,
GhostBottleneck, RepBottleneck, C2fAttn, Attention, PSA, SCDown, CIB,
C2fCIB.
doclayout_yolo/nn/modules/g2l_crm.py— addsfuse_custom()toDilatedBlock, DilatedBottleneck, and the top-level G2L_CRM module.
DilatedBlock.dilated_convnow handles fused checkpoints wherebnhas been replaced by
nn.Identity(falls back to usingconv.bias).doclayout_yolo/nn/modules/conv.py— supporting changes toConv.fuse_custom.doclayout_yolo/nn/modules/__init__.py— exportsCIB(was missingfrom
__all__, brokefrom doclayout_yolo.nn.modules import CIBforcustom code).
doclayout_yolo/cfg/default.yaml— newthreads:field for tuningCPU inference thread count.
CBFuseusessum()instead oftorch.sum(torch.stack(...))(no extra tensor allocation);RepVGGDW.forwardsupports the fused state (whenconv1is deletedby
fuse());MaxSigmoidAttnBlock.ectype fixed (wasConv, shouldbe
int).3.
chore(benchmarks): add CPU inference benchmark with opt-in togglesbenchmarks/benchmark_cpu_inference.py— new CLI:model.predict(...)pipelinewith
cache_flush()between runs. Both--channels-lastand--fuseare opt-in; nothing is enabled by default so existing workflows are
unaffected. No hardcoded paths — everything is argparse.
Why
DocLayout-YOLO has a
model.fuse()entry point in upstream Ultralytics,but it only handles
ConvandConv2— it skips most of the blockmodules (Bottleneck, C2f, C2fCIB, C2fAttn, G2L_CRM, etc.). When you load
a model and call
model.fuse(), the inner blocks don't actually fuse,so the speedup is much smaller than it could be. This PR adds the
missing
fuse_custom()methods sorecursive_fuse()from the benchmarkcan hit the whole network.
The
autobackendfix and the__init__.pyCIB export are prerequisitesthat this work surfaced — both were real blockers when loading fused
checkpoints or trying to
from ... import CIBfor custom code.Benchmark results
Run on CPU with the DocLayout-YOLO DocStructBench model (
imgsz=1024,conf=0.2, default PyTorch thread count):channels_last(NHWC)fuse_customrecursive +channels_lastReproduce with:
Both toggles are opt-in — no behavior change for users who don't pass
the flags. The
fuse_customrow is a range because the G2L_CRM /DilatedBlock / C2fCIB paths are not all fully fused under
--fuseyet(worth a follow-up PR to close the remaining gap), but the channel-format
change alone gives the bulk of the win.
Backwards compatibility
fuse_custom()methods are additive.threads:field indefault.yamlis optional.autobackendchange is a strict relaxation (more load cases nowsucceed, no previously-working load case is broken).
Checklist
pip install -e .)model.predict(...)produces identical detections before/after fuserecursive_fuseto handle the remainingG2L_CRM / C2fCIB / RepVGG paths that still allocate BN slots
(see code comments in
g2l_crm.pyandblock.py)