Separate concerns for compute-pcrs#294
Conversation
If another process had written, replace will fail. Instead, loop and retry on conflict. Add comments on "if this becomes a performance issue". Signed-off-by: Jakob Naucke <jnaucke@redhat.com> Assisted-by: Opus 4.6
and turn into a function, no longer needed in compute-pcrs. Signed-off-by: Jakob Naucke <jnaucke@redhat.com> Assisted-by: Opus 4.6
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Jakob-Naucke The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Reviewer's GuideRefactors status/condition helpers and image PCRs update logic to live in the operator crate, replaces macros with regular functions, and changes compute-pcrs so it only updates the ConfigMap with retry-on-conflict while the operator is responsible for status updates on ApprovedImage and cluster resources. Sequence diagram for compute-pcrs ConfigMap update with retry-on-conflictsequenceDiagram
actor JobPod
participant ComputePcrs as compute_pcrs_main
participant KubeAPI as Kube_API
participant ImagePcrsCM as ConfigMap_image_pcrs
JobPod->>ComputePcrs: main()
loop until ConfigMap replace succeeds
ComputePcrs->>KubeAPI: ConfigMap::get(PCR_CONFIG_MAP)
KubeAPI-->>ComputePcrs: image_pcrs_map
ComputePcrs->>ComputePcrs: serde_json::from_str(ImagePcrs)
ComputePcrs->>ComputePcrs: image_pcrs.0.insert(resource_name, ImagePcr)
ComputePcrs->>KubeAPI: ConfigMap::replace(PCR_CONFIG_MAP, image_pcrs_map)
alt conflict 409
KubeAPI-->>ComputePcrs: Error Api{code:409}
ComputePcrs->>ComputePcrs: retry
else success
KubeAPI-->>ComputePcrs: Ok
ComputePcrs->>JobPod: Ok(())
end
end
Sequence diagram for operator job_reconcile updating ApprovedImage statussequenceDiagram
participant Operator as job_reconcile
participant KubeAPI as Kube_API
participant ApprovedImage as ApprovedImage
participant ImagePcrsCM as ConfigMap_image_pcrs
participant Trustee as trustee_update_reference_values
Operator->>KubeAPI: Job::status
KubeAPI-->>Operator: Job{status:Complete}
Operator->>KubeAPI: Job::metadata.owner_references
KubeAPI-->>Operator: OwnerReference{name}
Operator->>KubeAPI: Api<ApprovedImage>::get(name)
KubeAPI-->>Operator: ApprovedImage
Operator->>Operator: committed_condition(INSTALLED_REASON, generation, None)
Operator->>Operator: ApprovedImageStatus{conditions}
Operator->>KubeAPI: update_status(Api<ApprovedImage>, name, status)
KubeAPI-->>Operator: Ok(())
Operator->>KubeAPI: Api<ConfigMap>::get(PCR_CONFIG_MAP)
KubeAPI-->>Operator: image_pcrs_map
Operator->>Operator: update_image_pcrs(config_maps, map, ImagePcrs)
Operator->>KubeAPI: ConfigMap::replace(PCR_CONFIG_MAP, map)
KubeAPI-->>Operator: Ok(())
Operator->>Trustee: update_reference_values(Client)
Trustee-->>Operator: Ok(())
Operator->>KubeAPI: Job::delete(name, DeleteParams::foreground)
KubeAPI-->>Operator: Ok(())
File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- The retry loop in
compute-pcrsfor handling 409 conflicts on the ConfigMap update has no backoff or max retry limit; consider adding a small delay and/or attempt cap to avoid a tight infinite loop if the conflict persists. - In
update_status, the generic bounds onKare quite broad andDebugis not used; you may be able to simplify the signature (and potentially move the constraint to theApi<K>caller) to make the helper more generally reusable and easier to read.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The retry loop in `compute-pcrs` for handling 409 conflicts on the ConfigMap update has no backoff or max retry limit; consider adding a small delay and/or attempt cap to avoid a tight infinite loop if the conflict persists.
- In `update_status`, the generic bounds on `K` are quite broad and `Debug` is not used; you may be able to simplify the signature (and potentially move the constraint to the `Api<K>` caller) to make the helper more generally reusable and easier to read.
## Individual Comments
### Comment 1
<location path="operator/src/reference_values.rs" line_range="129-130" />
<code_context>
return Ok(Action::requeue(Duration::from_secs(300)));
}
+
+ let err = "Completed job had no owner reference";
+ let resource_name = &job.meta().owner_references.as_ref().context(err)?[0].name;
+ let approved_images: Api<ApprovedImage> = Api::default_namespaced(kube_client.clone());
+ let try_image = approved_images.get(resource_name).await;
</code_context>
<issue_to_address>
**issue (bug_risk):** Accessing the first owner reference by index can panic if the vector is empty.
`context(err)?` only guards the `Option`; it doesn’t prevent `owner_references` from being `Some(vec![])`, which will panic on `[0]`. Use something like `and_then(|v| v.first())` to ensure a non-empty vector while still surfacing the error via `Context`, e.g.:
```rust
let owner = job
.meta()
.owner_references
.as_ref()
.and_then(|v| v.first())
.context(err)?;
let resource_name = &owner.name;
```
</issue_to_address>
### Comment 2
<location path="compute-pcrs/src/main.rs" line_range="66-75" />
<code_context>
+ // If we see this causing performance problems, consider NoSQL
</code_context>
<issue_to_address>
**suggestion (bug_risk):** The tight retry loop on ConfigMap conflicts lacks backoff or a retry limit.
Under high contention this can devolve into a busy-wait and generate excessive API calls. Add a brief backoff (e.g., `tokio::time::sleep`) and/or a maximum retry count so that persistent conflicts fail with an error instead of looping indefinitely.
Suggested implementation:
```rust
// If we see this causing performance problems, consider NoSQL
const MAX_PCR_CONFIGMAP_RETRIES: u32 = 10;
let mut pcr_configmap_retries: u32 = 0;
loop {
if pcr_configmap_retries >= MAX_PCR_CONFIGMAP_RETRIES {
anyhow::bail!(
"failed to update PCR ConfigMap after {MAX_PCR_CONFIGMAP_RETRIES} retries due to conflicts"
);
}
// simple linear backoff to avoid busy-waiting under contention
if pcr_configmap_retries > 0 {
// scale by attempt to slightly increase delay on repeated conflicts
sleep(Duration::from_millis(50 * pcr_configmap_retries as u64)).await;
}
pcr_configmap_retries += 1;
let mut image_pcrs_map = config_maps.get(PCR_CONFIG_MAP).await?;
```
To compile and run correctly, you will also need to:
1. Import the backoff primitives near the top of `compute-pcrs/src/main.rs`:
- `use std::time::Duration;`
- `use tokio::time::sleep;`
2. Ensure that the loop breaks out on a successful ConfigMap update (e.g., after a successful `config_maps.replace(...)` call). The retry counter and backoff only help if the code:
- `continue`s the loop specifically on conflict (HTTP 409) from the Kubernetes API, and
- `break`s or returns on success or non-retryable errors.
If the conflict handling logic is not yet present, you should:
- Match on the `kube::Error` from the update call,
- `continue` the loop only when the error is a 409 conflict,
- Return the error immediately for all other failure modes.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
ee4443f to
69932a0
Compare
not from compute-pcrs job for separation of concerns. Signed-off-by: Jakob Naucke <jnaucke@redhat.com> Assisted-by: Opus 4.6
69932a0 to
d0441b1
Compare
no longer needed, makes lib/ slimmer Signed-off-by: Jakob Naucke <jnaucke@redhat.com> Assisted-by: Opus 4.6
d0441b1 to
3409d7a
Compare
| let status = ApprovedImageStatus { conditions }; | ||
| update_status!(approved_images, &args.resource_name, status)?; | ||
| // If we see this causing performance problems, consider NoSQL | ||
| loop { |
There was a problem hiding this comment.
suggestion from @alicefr: instead of "managing" concurrent writes from the job, write back into an extra status field of the ApprovedImage and let the operator collect
|
@Jakob-Naucke: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
related #269
Summary by Sourcery
Separate status management from PCR computation and make ConfigMap updates for image PCRs more robust.
New Features:
Bug Fixes:
Enhancements: