CPU inference optimizations: fuse_custom() across all blocks + channels_last benchmark by ayaanmustafa · Pull Request #201 · opendatalab/DocLayout-YOLO

ayaanmustafa · 2026-06-10T22:05:12Z

Three commits that add a recursive fuse_custom() path to DocLayout-YOLO's
inference stack and ship a CPU benchmark to validate the gains. No model
architecture, training loop, or detection behavior is changed — the speedup
comes from fusing Conv+BN (and friends) and letting the benchmark toggle
NHWC memory format.

What's in this PR

1. `fix(autobackend): handle non-leaf parameters in fused checkpoints`

doclayout_yolo/nn/autobackend.py — when loading a fused checkpoint,
the loader tried to flip requires_grad=False on parameters that
weren't leaf tensors and crashed with:
RuntimeError: you can only change requires_grad flags of leaf variables.
Now we only set requires_grad on leaf tensors; non-leaf tensors get
.data.detach() instead. Lets YOLOv10("model.pt") actually load a
fused checkpoint end-to-end.

2. `feat(fuse): add fuse_custom() to all block modules + CPU thread config`

doclayout_yolo/nn/modules/block.py — adds fuse_custom() (safe,
idempotent, guarded by a fused flag) to:
SPP, SPPF, C1, C2, C3, C2f, Bottleneck, BottleneckCSP, ResNetBlock,
RepC3, RepCSP, RepNCSPELAN4, ADown, SPPELAN, HGStem, HGBlock, Proto,
GhostBottleneck, RepBottleneck, C2fAttn, Attention, PSA, SCDown, CIB,
C2fCIB.
doclayout_yolo/nn/modules/g2l_crm.py — adds fuse_custom() to
DilatedBlock, DilatedBottleneck, and the top-level G2L_CRM module.
DilatedBlock.dilated_conv now handles fused checkpoints where bn
has been replaced by nn.Identity (falls back to using conv.bias).
doclayout_yolo/nn/modules/conv.py — supporting changes to Conv.fuse_custom.
doclayout_yolo/nn/modules/__init__.py — exports CIB (was missing
from __all__, broke from doclayout_yolo.nn.modules import CIB for
custom code).
doclayout_yolo/cfg/default.yaml — new threads: field for tuning
CPU inference thread count.
Optimizations: CBFuse uses sum() instead of
torch.sum(torch.stack(...)) (no extra tensor allocation);
RepVGGDW.forward supports the fused state (when conv1 is deleted
by fuse()); MaxSigmoidAttnBlock.ec type fixed (was Conv, should
be int).

3. `chore(benchmarks): add CPU inference benchmark with opt-in toggles`

benchmarks/benchmark_cpu_inference.py — new CLI:
```
python benchmark_cpu_inference.py --model model.pt --image img.png
python benchmark_cpu_inference.py --model model.pt --image img.png --channels-last
python benchmark_cpu_inference.py --model model.pt --image img.png --channels-last --fuse
```
Measures raw forward pass and full model.predict(...) pipeline
with cache_flush() between runs. Both --channels-last and --fuse
are opt-in; nothing is enabled by default so existing workflows are
unaffected. No hardcoded paths — everything is argparse.

Why

DocLayout-YOLO has a model.fuse() entry point in upstream Ultralytics,
but it only handles Conv and Conv2 — it skips most of the block
modules (Bottleneck, C2f, C2fCIB, C2fAttn, G2L_CRM, etc.). When you load
a model and call model.fuse(), the inner blocks don't actually fuse,
so the speedup is much smaller than it could be. This PR adds the
missing fuse_custom() methods so recursive_fuse() from the benchmark
can hit the whole network.

The autobackend fix and the __init__.py CIB export are prerequisites
that this work surfaced — both were real blockers when loading fused
checkpoints or trying to from ... import CIB for custom code.

Benchmark results

Run on CPU with the DocLayout-YOLO DocStructBench model (imgsz=1024,
conf=0.2, default PyTorch thread count):

Configuration	Raw Forward (ms)	Speedup
Baseline (YOLOv10 default load, already fused)	~3274	1.00×
+ `channels_last` (NHWC)	~2387	1.37×
+ `fuse_custom` recursive + `channels_last`	~2260–2770	1.21–~1.5×

Reproduce with:

python benchmarks/benchmark_cpu_inference.py --model doclayout_yolo_docstructbench_imgsz1024.pt --image assets/example/academic.jpg
python benchmarks/benchmark_cpu_inference.py --model doclayout_yolo_docstructbench_imgsz1024.pt --image assets/example/academic.jpg --channels-last
python benchmarks/benchmark_cpu_inference.py --model doclayout_yolo_docstructbench_imgsz1024.pt --image assets/example/academic.jpg --channels-last --fuse

Both toggles are opt-in — no behavior change for users who don't pass
the flags. The fuse_custom row is a range because the G2L_CRM /
DilatedBlock / C2fCIB paths are not all fully fused under --fuse yet
(worth a follow-up PR to close the remaining gap), but the channel-format
change alone gives the bulk of the win.

Backwards compatibility

No public API changes. New fuse_custom() methods are additive.
New threads: field in default.yaml is optional.
New benchmark script is a new file, doesn't affect imports.
The autobackend change is a strict relaxation (more load cases now
succeed, no previously-working load case is broken).

Checklist

All commits build locally (pip install -e .)
model.predict(...) produces identical detections before/after fuse
No changes to training loop, loss, or model architecture
Benchmark script has no hardcoded paths
Follow-up: extend recursive_fuse to handle the remaining
G2L_CRM / C2fCIB / RepVGG paths that still allocate BN slots
(see code comments in g2l_crm.py and block.py)

PyTorch forbids setting requires_grad=False on non-leaf tensors. When fuse=True creates non-leaf parameters (e.g. from fused model checkpoints), the old code crashed with: RuntimeError: you can only change requires_grad flags of leaf variables. Fix: only set requires_grad=False on leaf tensors; for non-leaf tensors, detach via .data.detach() instead.

- Add fuse_custom() to 20+ block modules (SPP, SPPF, C1-C3, C2f, Bottleneck, BottleneckCSP, ResNetBlock, RepC3, RepCSP, RepNCSPELAN4, ADown, SPPELAN, HGStem, HGBlock, Proto, GhostBottleneck, RepBottleneck, C2fAttn, Attention, PSA, SCDown, CIB, C2fCIB, G2L_CRM, DilatedBlock, DilatedBottleneck) - Optimize CBFuse: use sum() instead of torch.sum(torch.stack()) - Optimize DilatedBlock: handle fused checkpoints (nn.Identity bn) - Fix RepVGGDW.forward to support fused state (conv1 deleted) - Fix MaxSigmoidAttnBlock.ec type (was Conv, should be int) - Add threads config field to default.yaml for CPU inference tuning

benchmark_cpu_inference.py tests --channels-last and --fuse as opt-in toggles, measuring raw forward and full predict pipeline. No hardcoded paths; uses argparse for model/image paths.

ayaanmustafa added 3 commits June 11, 2026 03:09

chore(benchmarks): add CPU inference benchmark with opt-in toggles

12a89e9

benchmark_cpu_inference.py tests --channels-last and --fuse as opt-in toggles, measuring raw forward and full predict pipeline. No hardcoded paths; uses argparse for model/image paths.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU inference optimizations: fuse_custom() across all blocks + channels_last benchmark#201

CPU inference optimizations: fuse_custom() across all blocks + channels_last benchmark#201
ayaanmustafa wants to merge 3 commits into
opendatalab:mainfrom
ayaanmustafa:cpu-inference-optimizations

ayaanmustafa commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ayaanmustafa commented Jun 10, 2026

What's in this PR

1. fix(autobackend): handle non-leaf parameters in fused checkpoints

2. feat(fuse): add fuse_custom() to all block modules + CPU thread config

3. chore(benchmarks): add CPU inference benchmark with opt-in toggles

Why

Benchmark results

Backwards compatibility

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. `fix(autobackend): handle non-leaf parameters in fused checkpoints`

2. `feat(fuse): add fuse_custom() to all block modules + CPU thread config`

3. `chore(benchmarks): add CPU inference benchmark with opt-in toggles`