Skip to content

doc: convergence engine#846

Draft
yoks wants to merge 9 commits intoNVIDIA:mainfrom
yoks:state-rfc
Draft

doc: convergence engine#846
yoks wants to merge 9 commits intoNVIDIA:mainfrom
yoks:state-rfc

Conversation

@yoks
Copy link
Copy Markdown
Contributor

@yoks yoks commented Apr 8, 2026

This PR, adds new design for Convergence Engine. It is supposed to replace current FSM based engine in core.

Design consistes of several docs, core one - https://github.com/yoks/bare-metal-manager-core/tree/state-rfc/book/src/design/convergence-engine

And multiple docs for each state handler (as of right now).

Design itself is quite breaking so i think it would have heavy discussion. I think best place is under this PR.

I used AI to repharse/spellcheck the document, or format it. But i after went couple times over it, it reads better than in my own words. If someone with really good English skills can improve/edit it, it can be great as well.

yoks added 3 commits April 7, 2026 16:17
Signed-off-by: ianisimov <ianisimov@nvidia.com>
Signed-off-by: ianisimov <ianisimov@nvidia.com>
Signed-off-by: ianisimov <ianisimov@nvidia.com>
@yoks yoks requested a review from a team as a code owner April 8, 2026 01:10
@yoks yoks changed the title doc: convergent engine doc: convergence engine Apr 8, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 8, 2026

🔐 TruffleHog Secret Scan

No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

🕐 Last updated: 2026-04-08 01:13:18 UTC | Commit: 3441946

yoks added 2 commits April 8, 2026 08:59
Signed-off-by: ianisimov <ianisimov@nvidia.com>
Signed-off-by: ianisimov <ianisimov@nvidia.com>
@kensimon
Copy link
Copy Markdown
Contributor

kensimon commented Apr 8, 2026

Thanks for the proposal, I think there are a lot of really good ideas in here. I think we should pick a particularly difficult bit of behavior in the machine state handler today and noodle on what it'd look like in the convergence engine.

Particularly I'm thinking about the reprovision behavior, particularly the WaitingForNetworkConfig state, which today means "wait for every DPU to check in with the desired network configuration, then power off the host, wait for it to be down, mark topology update needed, then once all the DPU's report the synced state, clear the reprovision flag".

What would this look like in the new world? The docs say:

HostReprovision: Not an operation — it's a desired-state change. When an operator requests reprovisioning, S_d is updated (new firmware version, new BIOS hash, new scout version). The engine detects the deltas and converges.

But I'm not sure that's a complete answer. For one, it may not be a new BIOS hash or new firmware version, reprovisioning may happen for a variety of reasons that don't necessarily mean a change in desired state (at least not from any property we observe... maybe the machine got stuck somewhere and we simply need to start the provisioning process over.)

But even if there was say, a desired firmware version change, how does the convergence engine ensure the operations happen in the right sequence? For instance we don't want the machine to boot while the DPU's are updating their firmware, but we do want it to reboot only after each DPU has finished the firmware update. But we also don't want to initialize any of this if an instance is assigned to a host. Could you sketch out what kind of op!'s would be needed to cause the right sequence of events to happen?

I could imagine plenty of "virtual" state keys, like a "ReprovisionBarrierPhase" state, with an op that moves it to a special "all_dpus_updated" state when a special guard runs saying all_dpus(eq(FirmwareVersion, desired(FirmwareVersion)) or something, but that's hand-wavey pseudo-code that I'm not sure is actually expressible here...

yoks added 3 commits April 8, 2026 09:01
Signed-off-by: ianisimov <ianisimov@nvidia.com>
Signed-off-by: ianisimov <ianisimov@nvidia.com>
Signed-off-by: ianisimov <ianisimov@nvidia.com>
@kensimon kensimon marked this pull request as draft April 8, 2026 16:25
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 8, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@kensimon
Copy link
Copy Markdown
Contributor

kensimon commented Apr 8, 2026

Note: I converted this to a draft so that CI won't run automatically (no need to waste CI cycles on a documentation PR.)

@yoks
Copy link
Copy Markdown
Contributor Author

yoks commented Apr 8, 2026

There is two things here, first is how we can do something if there is no delta. And second how to orchestrate complex operations.

So if we trully could not express something in terms of observable state change, it does not belongs here. I could think of examples such as running some diagnostic, exporting data etc. But when we talking about stuck instance or power cycle. This does change observable state. So answer would be to model them properly. Just power cycle is really good example here, we can have key like BootGeneration, by updating this key we can reboot machine. This is handful information in itself. Same goes for reprosvision, we can have key which gives ProvisionGeneration, by incremeting it we can express new desired state of next generation of machine provision etc. This is one way to think about it, it looks quite organic as it contain usefull info (how many times machine rebooted/provisioned) and looks like it belongs in desired/observed state.

As for second question, i think guard algebra is sufficent to cover it. We can have keys which can express is it single or multipe dpu like this:

enum Scope {
    Index(usize),
    All, 
}

DpuFirmwareStaged(Index(0)) = "24.04"
DpuFirmwareStaged(Index(1)) = "24.04"
DpuFirmwareStaged(All) = "24.04"

So op! can be formulated like this:

op!(stage_dpu_firmware {
    provides: [DpuFirmwareStaged(All)],
    guard: and(
        eq(PowerState, "on"),
        neq(DpuFirmwareStaged(All), desired(DpuFirmwareVersion)),
        eq(InstanceAssigned, "false"),
    ),
    locks: [DpuFirmware(All)],
    effects: [DpuFirmwareStaged(All) => desired(DpuFirmwareVersion)],
    steps: [action(dpu_flash_firmware, dpu_index = All)],
    ...
});

op!(activate_dpu_firmware {
    provides: [DpuFirmwareVersion],
    guard: and(
        eq(DpuFirmwareStaged(All), desired(DpuFirmwareVersion)),
        neq(DpuFirmwareVersion, desired(DpuFirmwareVersion)),
        eq(InstanceAssigned, "false"),
    ),
    locks: [Power, DpuFirmware],
    effects: [DpuFirmwareVersion => desired(DpuFirmwareVersion)],
    steps: [
        action(redfish_power_cycle),
        action(wait_for_host_and_dpus_ready),
    ],
    ...
});

It does introduce new key DpuFirmwareStaged, but it still allows engine to execute and converge on it.

Signed-off-by: ianisimov <ianisimov@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants