Skip to content

CPU inference optimizations: fuse_custom() across all blocks + channels_last benchmark#201

Open
ayaanmustafa wants to merge 3 commits into
opendatalab:mainfrom
ayaanmustafa:cpu-inference-optimizations
Open

CPU inference optimizations: fuse_custom() across all blocks + channels_last benchmark#201
ayaanmustafa wants to merge 3 commits into
opendatalab:mainfrom
ayaanmustafa:cpu-inference-optimizations

Conversation

@ayaanmustafa

Copy link
Copy Markdown

Three commits that add a recursive fuse_custom() path to DocLayout-YOLO's
inference stack and ship a CPU benchmark to validate the gains. No model
architecture, training loop, or detection behavior is changed — the speedup
comes from fusing Conv+BN (and friends) and letting the benchmark toggle
NHWC memory format.

What's in this PR

1. fix(autobackend): handle non-leaf parameters in fused checkpoints

  • doclayout_yolo/nn/autobackend.py — when loading a fused checkpoint,
    the loader tried to flip requires_grad=False on parameters that
    weren't leaf tensors and crashed with:
    RuntimeError: you can only change requires_grad flags of leaf variables.
    Now we only set requires_grad on leaf tensors; non-leaf tensors get
    .data.detach() instead. Lets YOLOv10("model.pt") actually load a
    fused checkpoint end-to-end.

2. feat(fuse): add fuse_custom() to all block modules + CPU thread config

  • doclayout_yolo/nn/modules/block.py — adds fuse_custom() (safe,
    idempotent, guarded by a fused flag) to:
    SPP, SPPF, C1, C2, C3, C2f, Bottleneck, BottleneckCSP, ResNetBlock,
    RepC3, RepCSP, RepNCSPELAN4, ADown, SPPELAN, HGStem, HGBlock, Proto,
    GhostBottleneck, RepBottleneck, C2fAttn, Attention, PSA, SCDown, CIB,
    C2fCIB.
  • doclayout_yolo/nn/modules/g2l_crm.py — adds fuse_custom() to
    DilatedBlock, DilatedBottleneck, and the top-level G2L_CRM module.
    DilatedBlock.dilated_conv now handles fused checkpoints where bn
    has been replaced by nn.Identity (falls back to using conv.bias).
  • doclayout_yolo/nn/modules/conv.py — supporting changes to Conv.fuse_custom.
  • doclayout_yolo/nn/modules/__init__.py — exports CIB (was missing
    from __all__, broke from doclayout_yolo.nn.modules import CIB for
    custom code).
  • doclayout_yolo/cfg/default.yaml — new threads: field for tuning
    CPU inference thread count.
  • Optimizations: CBFuse uses sum() instead of
    torch.sum(torch.stack(...)) (no extra tensor allocation);
    RepVGGDW.forward supports the fused state (when conv1 is deleted
    by fuse()); MaxSigmoidAttnBlock.ec type fixed (was Conv, should
    be int).

3. chore(benchmarks): add CPU inference benchmark with opt-in toggles

  • benchmarks/benchmark_cpu_inference.py — new CLI:
    python benchmark_cpu_inference.py --model model.pt --image img.png
    python benchmark_cpu_inference.py --model model.pt --image img.png --channels-last
    python benchmark_cpu_inference.py --model model.pt --image img.png --channels-last --fuse
    
    Measures raw forward pass and full model.predict(...) pipeline
    with cache_flush() between runs. Both --channels-last and --fuse
    are opt-in; nothing is enabled by default so existing workflows are
    unaffected. No hardcoded paths — everything is argparse.

Why

DocLayout-YOLO has a model.fuse() entry point in upstream Ultralytics,
but it only handles Conv and Conv2 — it skips most of the block
modules (Bottleneck, C2f, C2fCIB, C2fAttn, G2L_CRM, etc.). When you load
a model and call model.fuse(), the inner blocks don't actually fuse,
so the speedup is much smaller than it could be. This PR adds the
missing fuse_custom() methods so recursive_fuse() from the benchmark
can hit the whole network.

The autobackend fix and the __init__.py CIB export are prerequisites
that this work surfaced — both were real blockers when loading fused
checkpoints or trying to from ... import CIB for custom code.

Benchmark results

Run on CPU with the DocLayout-YOLO DocStructBench model (imgsz=1024,
conf=0.2, default PyTorch thread count):

Configuration Raw Forward (ms) Speedup
Baseline (YOLOv10 default load, already fused) ~3274 1.00×
+ channels_last (NHWC) ~2387 1.37×
+ fuse_custom recursive + channels_last ~2260–2770 1.21–~1.5×

Reproduce with:

python benchmarks/benchmark_cpu_inference.py --model doclayout_yolo_docstructbench_imgsz1024.pt --image assets/example/academic.jpg
python benchmarks/benchmark_cpu_inference.py --model doclayout_yolo_docstructbench_imgsz1024.pt --image assets/example/academic.jpg --channels-last
python benchmarks/benchmark_cpu_inference.py --model doclayout_yolo_docstructbench_imgsz1024.pt --image assets/example/academic.jpg --channels-last --fuse

Both toggles are opt-in — no behavior change for users who don't pass
the flags. The fuse_custom row is a range because the G2L_CRM /
DilatedBlock / C2fCIB paths are not all fully fused under --fuse yet
(worth a follow-up PR to close the remaining gap), but the channel-format
change alone gives the bulk of the win.

Backwards compatibility

  • No public API changes. New fuse_custom() methods are additive.
  • New threads: field in default.yaml is optional.
  • New benchmark script is a new file, doesn't affect imports.
  • The autobackend change is a strict relaxation (more load cases now
    succeed, no previously-working load case is broken).

Checklist

  • All commits build locally (pip install -e .)
  • model.predict(...) produces identical detections before/after fuse
  • No changes to training loop, loss, or model architecture
  • Benchmark script has no hardcoded paths
  • Follow-up: extend recursive_fuse to handle the remaining
    G2L_CRM / C2fCIB / RepVGG paths that still allocate BN slots
    (see code comments in g2l_crm.py and block.py)

PyTorch forbids setting requires_grad=False on non-leaf tensors.
When fuse=True creates non-leaf parameters (e.g. from fused model
checkpoints), the old code crashed with:
RuntimeError: you can only change requires_grad flags of leaf variables.

Fix: only set requires_grad=False on leaf tensors; for non-leaf tensors,
detach via .data.detach() instead.
- Add fuse_custom() to 20+ block modules (SPP, SPPF, C1-C3, C2f,
  Bottleneck, BottleneckCSP, ResNetBlock, RepC3, RepCSP, RepNCSPELAN4,
  ADown, SPPELAN, HGStem, HGBlock, Proto, GhostBottleneck, RepBottleneck,
  C2fAttn, Attention, PSA, SCDown, CIB, C2fCIB, G2L_CRM, DilatedBlock,
  DilatedBottleneck)
- Optimize CBFuse: use sum() instead of torch.sum(torch.stack())
- Optimize DilatedBlock: handle fused checkpoints (nn.Identity bn)
- Fix RepVGGDW.forward to support fused state (conv1 deleted)
- Fix MaxSigmoidAttnBlock.ec type (was Conv, should be int)
- Add threads config field to default.yaml for CPU inference tuning
benchmark_cpu_inference.py tests --channels-last and --fuse as
opt-in toggles, measuring raw forward and full predict pipeline.
No hardcoded paths; uses argparse for model/image paths.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant