doc: convergence engine by yoks · Pull Request #846 · NVIDIA/ncx-infra-controller-core

yoks · 2026-04-08T01:10:52Z

This PR, adds new design for Convergence Engine. It is supposed to replace current FSM based engine in core.

Design consistes of several docs, core one - https://github.com/yoks/bare-metal-manager-core/tree/state-rfc/book/src/design/convergence-engine

And multiple docs for each state handler (as of right now).

Design itself is quite breaking so i think it would have heavy discussion. I think best place is under this PR.

I used AI to repharse/spellcheck the document, or format it. But i after went couple times over it, it reads better than in my own words. If someone with really good English skills can improve/edit it, it can be great as well.

Signed-off-by: ianisimov <ianisimov@nvidia.com>

github-actions · 2026-04-08T01:13:19Z

🔐 TruffleHog Secret Scan

✅ No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

_{🕐 Last updated: 2026-04-08 01:13:18 UTC | Commit: 3441946}

Signed-off-by: ianisimov <ianisimov@nvidia.com>

kensimon · 2026-04-08T16:00:52Z

Thanks for the proposal, I think there are a lot of really good ideas in here. I think we should pick a particularly difficult bit of behavior in the machine state handler today and noodle on what it'd look like in the convergence engine.

Particularly I'm thinking about the reprovision behavior, particularly the WaitingForNetworkConfig state, which today means "wait for every DPU to check in with the desired network configuration, then power off the host, wait for it to be down, mark topology update needed, then once all the DPU's report the synced state, clear the reprovision flag".

What would this look like in the new world? The docs say:

HostReprovision: Not an operation — it's a desired-state change. When an operator requests reprovisioning, S_d is updated (new firmware version, new BIOS hash, new scout version). The engine detects the deltas and converges.

But I'm not sure that's a complete answer. For one, it may not be a new BIOS hash or new firmware version, reprovisioning may happen for a variety of reasons that don't necessarily mean a change in desired state (at least not from any property we observe... maybe the machine got stuck somewhere and we simply need to start the provisioning process over.)

But even if there was say, a desired firmware version change, how does the convergence engine ensure the operations happen in the right sequence? For instance we don't want the machine to boot while the DPU's are updating their firmware, but we do want it to reboot only after each DPU has finished the firmware update. But we also don't want to initialize any of this if an instance is assigned to a host. Could you sketch out what kind of op!'s would be needed to cause the right sequence of events to happen?

I could imagine plenty of "virtual" state keys, like a "ReprovisionBarrierPhase" state, with an op that moves it to a special "all_dpus_updated" state when a special guard runs saying all_dpus(eq(FirmwareVersion, desired(FirmwareVersion)) or something, but that's hand-wavey pseudo-code that I'm not sure is actually expressible here...

Signed-off-by: ianisimov <ianisimov@nvidia.com>

copy-pr-bot · 2026-04-08T16:25:24Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

kensimon · 2026-04-08T16:25:51Z

Note: I converted this to a draft so that CI won't run automatically (no need to waste CI cycles on a documentation PR.)

yoks · 2026-04-08T18:49:36Z

There is two things here, first is how we can do something if there is no delta. And second how to orchestrate complex operations.

So if we trully could not express something in terms of observable state change, it does not belongs here. I could think of examples such as running some diagnostic, exporting data etc. But when we talking about stuck instance or power cycle. This does change observable state. So answer would be to model them properly. Just power cycle is really good example here, we can have key like BootGeneration, by updating this key we can reboot machine. This is handful information in itself. Same goes for reprosvision, we can have key which gives ProvisionGeneration, by incremeting it we can express new desired state of next generation of machine provision etc. This is one way to think about it, it looks quite organic as it contain usefull info (how many times machine rebooted/provisioned) and looks like it belongs in desired/observed state.

As for second question, i think guard algebra is sufficent to cover it. We can have keys which can express is it single or multipe dpu like this:

enum Scope {
    Index(usize),
    All, 
}

DpuFirmwareStaged(Index(0)) = "24.04"
DpuFirmwareStaged(Index(1)) = "24.04"
DpuFirmwareStaged(All) = "24.04"

So op! can be formulated like this:

op!(stage_dpu_firmware {
    provides: [DpuFirmwareStaged(All)],
    guard: and(
        eq(PowerState, "on"),
        neq(DpuFirmwareStaged(All), desired(DpuFirmwareVersion)),
        eq(InstanceAssigned, "false"),
    ),
    locks: [DpuFirmware(All)],
    effects: [DpuFirmwareStaged(All) => desired(DpuFirmwareVersion)],
    steps: [action(dpu_flash_firmware, dpu_index = All)],
    ...
});

op!(activate_dpu_firmware {
    provides: [DpuFirmwareVersion],
    guard: and(
        eq(DpuFirmwareStaged(All), desired(DpuFirmwareVersion)),
        neq(DpuFirmwareVersion, desired(DpuFirmwareVersion)),
        eq(InstanceAssigned, "false"),
    ),
    locks: [Power, DpuFirmware],
    effects: [DpuFirmwareVersion => desired(DpuFirmwareVersion)],
    steps: [
        action(redfish_power_cycle),
        action(wait_for_host_and_dpus_ready),
    ],
    ...
});

It does introduce new key DpuFirmwareStaged, but it still allows engine to execute and converge on it.

Signed-off-by: ianisimov <ianisimov@nvidia.com>

yoks added 3 commits April 7, 2026 16:17

doc: convergence engine

e1b6bbc

Signed-off-by: ianisimov <ianisimov@nvidia.com>

doc: convergence engine

91fc37c

Signed-off-by: ianisimov <ianisimov@nvidia.com>

doc: convergence engine

3441946

Signed-off-by: ianisimov <ianisimov@nvidia.com>

yoks requested a review from a team as a code owner April 8, 2026 01:10

yoks changed the title ~~doc: convergent engine~~ doc: convergence engine Apr 8, 2026

yoks added 2 commits April 8, 2026 08:59

doc: convergence engine

a8125d9

Signed-off-by: ianisimov <ianisimov@nvidia.com>

doc: convergence engine

99bd5af

Signed-off-by: ianisimov <ianisimov@nvidia.com>

yoks added 3 commits April 8, 2026 09:01

doc: convergence engine

fbd43bb

Signed-off-by: ianisimov <ianisimov@nvidia.com>

doc: convergence engine

5528f8f

Signed-off-by: ianisimov <ianisimov@nvidia.com>

doc: convergence engine

314bc9a

Signed-off-by: ianisimov <ianisimov@nvidia.com>

kensimon marked this pull request as draft April 8, 2026 16:25

doc: convergence engine

a87bec1

Signed-off-by: ianisimov <ianisimov@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

doc: convergence engine#846

doc: convergence engine#846
yoks wants to merge 9 commits intoNVIDIA:mainfrom
yoks:state-rfc

yoks commented Apr 8, 2026

Uh oh!

github-actions bot commented Apr 8, 2026

Uh oh!

kensimon commented Apr 8, 2026

Uh oh!

copy-pr-bot bot commented Apr 8, 2026

Uh oh!

kensimon commented Apr 8, 2026

Uh oh!

yoks commented Apr 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yoks commented Apr 8, 2026

Uh oh!

github-actions bot commented Apr 8, 2026

🔐 TruffleHog Secret Scan

Uh oh!

kensimon commented Apr 8, 2026

Uh oh!

copy-pr-bot bot commented Apr 8, 2026

Uh oh!

kensimon commented Apr 8, 2026

Uh oh!

yoks commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yoks commented Apr 8, 2026 •

edited

Loading