Conversation
Signed-off-by: ianisimov <ianisimov@nvidia.com>
Signed-off-by: ianisimov <ianisimov@nvidia.com>
Signed-off-by: ianisimov <ianisimov@nvidia.com>
🔐 TruffleHog Secret Scan✅ No secrets or credentials found! Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉 🕐 Last updated: 2026-04-08 01:13:18 UTC | Commit: 3441946 |
Signed-off-by: ianisimov <ianisimov@nvidia.com>
Signed-off-by: ianisimov <ianisimov@nvidia.com>
|
Thanks for the proposal, I think there are a lot of really good ideas in here. I think we should pick a particularly difficult bit of behavior in the machine state handler today and noodle on what it'd look like in the convergence engine. Particularly I'm thinking about the reprovision behavior, particularly the WaitingForNetworkConfig state, which today means "wait for every DPU to check in with the desired network configuration, then power off the host, wait for it to be down, mark topology update needed, then once all the DPU's report the synced state, clear the reprovision flag". What would this look like in the new world? The docs say:
But I'm not sure that's a complete answer. For one, it may not be a new BIOS hash or new firmware version, reprovisioning may happen for a variety of reasons that don't necessarily mean a change in desired state (at least not from any property we observe... maybe the machine got stuck somewhere and we simply need to start the provisioning process over.) But even if there was say, a desired firmware version change, how does the convergence engine ensure the operations happen in the right sequence? For instance we don't want the machine to boot while the DPU's are updating their firmware, but we do want it to reboot only after each DPU has finished the firmware update. But we also don't want to initialize any of this if an instance is assigned to a host. Could you sketch out what kind of I could imagine plenty of "virtual" state keys, like a "ReprovisionBarrierPhase" state, with an op that moves it to a special "all_dpus_updated" state when a special guard runs saying |
Signed-off-by: ianisimov <ianisimov@nvidia.com>
Signed-off-by: ianisimov <ianisimov@nvidia.com>
Signed-off-by: ianisimov <ianisimov@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Note: I converted this to a draft so that CI won't run automatically (no need to waste CI cycles on a documentation PR.) |
|
There is two things here, first is how we can So if we trully could not express something in terms of observable state change, it does not belongs here. I could think of examples such as running some diagnostic, exporting data etc. But when we talking about stuck instance or power cycle. This does change observable state. So answer would be to model them properly. Just power cycle is really good example here, we can have key like BootGeneration, by updating this key we can reboot machine. This is handful information in itself. Same goes for reprosvision, we can have key which gives ProvisionGeneration, by incremeting it we can express new desired state of next generation of machine provision etc. This is one way to think about it, it looks quite organic as it contain usefull info (how many times machine rebooted/provisioned) and looks like it belongs in desired/observed state. As for second question, i think guard algebra is sufficent to cover it. We can have keys which can express is it single or multipe dpu like this: enum Scope {
Index(usize),
All,
}
DpuFirmwareStaged(Index(0)) = "24.04"
DpuFirmwareStaged(Index(1)) = "24.04"
DpuFirmwareStaged(All) = "24.04"So op!(stage_dpu_firmware {
provides: [DpuFirmwareStaged(All)],
guard: and(
eq(PowerState, "on"),
neq(DpuFirmwareStaged(All), desired(DpuFirmwareVersion)),
eq(InstanceAssigned, "false"),
),
locks: [DpuFirmware(All)],
effects: [DpuFirmwareStaged(All) => desired(DpuFirmwareVersion)],
steps: [action(dpu_flash_firmware, dpu_index = All)],
...
});
op!(activate_dpu_firmware {
provides: [DpuFirmwareVersion],
guard: and(
eq(DpuFirmwareStaged(All), desired(DpuFirmwareVersion)),
neq(DpuFirmwareVersion, desired(DpuFirmwareVersion)),
eq(InstanceAssigned, "false"),
),
locks: [Power, DpuFirmware],
effects: [DpuFirmwareVersion => desired(DpuFirmwareVersion)],
steps: [
action(redfish_power_cycle),
action(wait_for_host_and_dpus_ready),
],
...
});It does introduce new key |
Signed-off-by: ianisimov <ianisimov@nvidia.com>
This PR, adds new design for Convergence Engine. It is supposed to replace current FSM based engine in core.
Design consistes of several docs, core one - https://github.com/yoks/bare-metal-manager-core/tree/state-rfc/book/src/design/convergence-engine
And multiple docs for each state handler (as of right now).
Design itself is quite breaking so i think it would have heavy discussion. I think best place is under this PR.
I used AI to repharse/spellcheck the document, or format it. But i after went couple times over it, it reads better than in my own words. If someone with really good English skills can improve/edit it, it can be great as well.