Skip to content

Separate concerns for compute-pcrs#294

Open
Jakob-Naucke wants to merge 4 commits into
trusted-execution-clusters:mainfrom
Jakob-Naucke:sep-compute-pcrs-concerns
Open

Separate concerns for compute-pcrs#294
Jakob-Naucke wants to merge 4 commits into
trusted-execution-clusters:mainfrom
Jakob-Naucke:sep-compute-pcrs-concerns

Conversation

@Jakob-Naucke

@Jakob-Naucke Jakob-Naucke commented Jun 26, 2026

Copy link
Copy Markdown
Member
  • retry image pcrs update in compute-pcrs
  • do not update status in compute-pcrs
  • refactor for things that don't need to be shared and functions that don't need to be macros

related #269

Summary by Sourcery

Separate status management from PCR computation and make ConfigMap updates for image PCRs more robust.

New Features:

  • Initialize structured logging in the compute-pcrs binary.

Bug Fixes:

  • Retry image PCR ConfigMap updates in compute-pcrs to handle update conflicts.
  • Ensure completed PCR compute jobs update ApprovedImage status conditions in the operator rather than in the compute-pcrs binary.

Enhancements:

  • Replace macros for status and image PCR updates with reusable async helper functions in the operator.
  • Move condition and transition-time utilities from the shared library into the operator where they are used.
  • Make ImagePcr clonable to support reuse during ConfigMap update retries.
  • Add dedicated helper and tests for updating image PCRs ConfigMaps from the operator.

If another process had written, replace will fail. Instead, loop and
retry on conflict.

Add comments on "if this becomes a performance issue".

Signed-off-by: Jakob Naucke <jnaucke@redhat.com>
Assisted-by: Opus 4.6
and turn into a function, no longer needed in compute-pcrs.

Signed-off-by: Jakob Naucke <jnaucke@redhat.com>
Assisted-by: Opus 4.6
@openshift-ci

openshift-ci Bot commented Jun 26, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Jakob-Naucke

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sourcery-ai

sourcery-ai Bot commented Jun 26, 2026

Copy link
Copy Markdown

Reviewer's Guide

Refactors status/condition helpers and image PCRs update logic to live in the operator crate, replaces macros with regular functions, and changes compute-pcrs so it only updates the ConfigMap with retry-on-conflict while the operator is responsible for status updates on ApprovedImage and cluster resources.

Sequence diagram for compute-pcrs ConfigMap update with retry-on-conflict

sequenceDiagram
    actor JobPod
    participant ComputePcrs as compute_pcrs_main
    participant KubeAPI as Kube_API
    participant ImagePcrsCM as ConfigMap_image_pcrs

    JobPod->>ComputePcrs: main()
    loop until ConfigMap replace succeeds
        ComputePcrs->>KubeAPI: ConfigMap::get(PCR_CONFIG_MAP)
        KubeAPI-->>ComputePcrs: image_pcrs_map
        ComputePcrs->>ComputePcrs: serde_json::from_str(ImagePcrs)
        ComputePcrs->>ComputePcrs: image_pcrs.0.insert(resource_name, ImagePcr)
        ComputePcrs->>KubeAPI: ConfigMap::replace(PCR_CONFIG_MAP, image_pcrs_map)
        alt conflict 409
            KubeAPI-->>ComputePcrs: Error Api{code:409}
            ComputePcrs->>ComputePcrs: retry
        else success
            KubeAPI-->>ComputePcrs: Ok
            ComputePcrs->>JobPod: Ok(())
        end
    end
Loading

Sequence diagram for operator job_reconcile updating ApprovedImage status

sequenceDiagram
    participant Operator as job_reconcile
    participant KubeAPI as Kube_API
    participant ApprovedImage as ApprovedImage
    participant ImagePcrsCM as ConfigMap_image_pcrs
    participant Trustee as trustee_update_reference_values

    Operator->>KubeAPI: Job::status
    KubeAPI-->>Operator: Job{status:Complete}
    Operator->>KubeAPI: Job::metadata.owner_references
    KubeAPI-->>Operator: OwnerReference{name}

    Operator->>KubeAPI: Api<ApprovedImage>::get(name)
    KubeAPI-->>Operator: ApprovedImage
    Operator->>Operator: committed_condition(INSTALLED_REASON, generation, None)
    Operator->>Operator: ApprovedImageStatus{conditions}
    Operator->>KubeAPI: update_status(Api<ApprovedImage>, name, status)
    KubeAPI-->>Operator: Ok(())

    Operator->>KubeAPI: Api<ConfigMap>::get(PCR_CONFIG_MAP)
    KubeAPI-->>Operator: image_pcrs_map
    Operator->>Operator: update_image_pcrs(config_maps, map, ImagePcrs)
    Operator->>KubeAPI: ConfigMap::replace(PCR_CONFIG_MAP, map)
    KubeAPI-->>Operator: Ok(())

    Operator->>Trustee: update_reference_values(Client)
    Trustee-->>Operator: Ok(())

    Operator->>KubeAPI: Job::delete(name, DeleteParams::foreground)
    KubeAPI-->>Operator: Ok(())
Loading

File-Level Changes

Change Details Files
Move status/condition helper code from the shared lib into the operator crate and replace macro-based status updates with a typed async helper.
  • Remove update_status macro, condition helpers, and trait/impls from lib.rs
  • Introduce async update_status function and condition helpers (condition_status, Conditions trait, transition_time, committed_condition) in operator/src/lib.rs
  • Wire operator main and conditions modules to use the new update_status function and locally defined helpers instead of the removed lib exports
lib/src/lib.rs
operator/src/lib.rs
operator/src/main.rs
operator/src/conditions.rs
Refactor image PCRs update logic into a reusable async function in the operator and adjust callers and tests accordingly.
  • Remove update_image_pcrs macro from lib/reference_values.rs and make ImagePcr clonable
  • Add async update_image_pcrs function in operator/reference_values.rs that serializes ImagePcrs and replaces the PCR ConfigMap
  • Update handle_new_image and disallow_image to call update_image_pcrs and return proper errors
  • Extend tests to cover update_image_pcrs success/error paths and adjust existing tests for new behavior (e.g., ownerReferences, new API call order)
lib/src/reference_values.rs
operator/src/reference_values.rs
Change compute-pcrs binary to only update image PCRs in the ConfigMap with retry-on-conflict and no longer manipulate ApprovedImage status.
  • Initialize env_logger for compute-pcrs for runtime logging
  • Fetch and update the image-pcrs ConfigMap in a retry loop, handling 409 conflict responses by re-reading and retrying the replace
  • Stop fetching ApprovedImage resources and stop updating their status conditions from compute-pcrs; only modify the ConfigMap
  • Add Clone derive to ImagePcr to support cloning when updating the map
compute-pcrs/src/main.rs
lib/src/reference_values.rs
compute-pcrs/Cargo.toml
Ensure ApprovedImage status is updated by the operator when compute-pcrs jobs complete instead of by the compute-pcrs binary itself.
  • Extend job_reconcile to resolve the owning ApprovedImage from the Job ownerReference
  • Compute a committed condition using committed_condition and update ApprovedImageStatus via update_status
  • Adapt job_reconcile test expectations to the new API call order and behavior
operator/src/reference_values.rs
Update attestation key approval to use the new async status update helper.
  • Replace update_status macro usage with update_status function when approving AttestationKey resources
operator/src/attestation_key_register.rs

Possibly linked issues


Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues, and left some high level feedback:

  • The retry loop in compute-pcrs for handling 409 conflicts on the ConfigMap update has no backoff or max retry limit; consider adding a small delay and/or attempt cap to avoid a tight infinite loop if the conflict persists.
  • In update_status, the generic bounds on K are quite broad and Debug is not used; you may be able to simplify the signature (and potentially move the constraint to the Api<K> caller) to make the helper more generally reusable and easier to read.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The retry loop in `compute-pcrs` for handling 409 conflicts on the ConfigMap update has no backoff or max retry limit; consider adding a small delay and/or attempt cap to avoid a tight infinite loop if the conflict persists.
- In `update_status`, the generic bounds on `K` are quite broad and `Debug` is not used; you may be able to simplify the signature (and potentially move the constraint to the `Api<K>` caller) to make the helper more generally reusable and easier to read.

## Individual Comments

### Comment 1
<location path="operator/src/reference_values.rs" line_range="129-130" />
<code_context>
         return Ok(Action::requeue(Duration::from_secs(300)));
     }
+
+    let err = "Completed job had no owner reference";
+    let resource_name = &job.meta().owner_references.as_ref().context(err)?[0].name;
+    let approved_images: Api<ApprovedImage> = Api::default_namespaced(kube_client.clone());
+    let try_image = approved_images.get(resource_name).await;
</code_context>
<issue_to_address>
**issue (bug_risk):** Accessing the first owner reference by index can panic if the vector is empty.

`context(err)?` only guards the `Option`; it doesn’t prevent `owner_references` from being `Some(vec![])`, which will panic on `[0]`. Use something like `and_then(|v| v.first())` to ensure a non-empty vector while still surfacing the error via `Context`, e.g.:

```rust
let owner = job
    .meta()
    .owner_references
    .as_ref()
    .and_then(|v| v.first())
    .context(err)?;
let resource_name = &owner.name;
```
</issue_to_address>

### Comment 2
<location path="compute-pcrs/src/main.rs" line_range="66-75" />
<code_context>
+    // If we see this causing performance problems, consider NoSQL
</code_context>
<issue_to_address>
**suggestion (bug_risk):** The tight retry loop on ConfigMap conflicts lacks backoff or a retry limit.

Under high contention this can devolve into a busy-wait and generate excessive API calls. Add a brief backoff (e.g., `tokio::time::sleep`) and/or a maximum retry count so that persistent conflicts fail with an error instead of looping indefinitely.

Suggested implementation:

```rust
    // If we see this causing performance problems, consider NoSQL
    const MAX_PCR_CONFIGMAP_RETRIES: u32 = 10;
    let mut pcr_configmap_retries: u32 = 0;

    loop {
        if pcr_configmap_retries >= MAX_PCR_CONFIGMAP_RETRIES {
            anyhow::bail!(
                "failed to update PCR ConfigMap after {MAX_PCR_CONFIGMAP_RETRIES} retries due to conflicts"
            );
        }

        // simple linear backoff to avoid busy-waiting under contention
        if pcr_configmap_retries > 0 {
            // scale by attempt to slightly increase delay on repeated conflicts
            sleep(Duration::from_millis(50 * pcr_configmap_retries as u64)).await;
        }

        pcr_configmap_retries += 1;

        let mut image_pcrs_map = config_maps.get(PCR_CONFIG_MAP).await?;

```

To compile and run correctly, you will also need to:

1. Import the backoff primitives near the top of `compute-pcrs/src/main.rs`:
   - `use std::time::Duration;`
   - `use tokio::time::sleep;`

2. Ensure that the loop breaks out on a successful ConfigMap update (e.g., after a successful `config_maps.replace(...)` call). The retry counter and backoff only help if the code:
   - `continue`s the loop specifically on conflict (HTTP 409) from the Kubernetes API, and
   - `break`s or returns on success or non-retryable errors.

If the conflict handling logic is not yet present, you should:
- Match on the `kube::Error` from the update call,
- `continue` the loop only when the error is a 409 conflict,
- Return the error immediately for all other failure modes.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread operator/src/reference_values.rs Outdated
Comment thread compute-pcrs/src/main.rs
@Jakob-Naucke Jakob-Naucke force-pushed the sep-compute-pcrs-concerns branch from ee4443f to 69932a0 Compare June 26, 2026 10:30
not from compute-pcrs job for separation of concerns.

Signed-off-by: Jakob Naucke <jnaucke@redhat.com>
Assisted-by: Opus 4.6
@Jakob-Naucke Jakob-Naucke force-pushed the sep-compute-pcrs-concerns branch from 69932a0 to d0441b1 Compare June 26, 2026 10:37
no longer needed, makes lib/ slimmer

Signed-off-by: Jakob Naucke <jnaucke@redhat.com>
Assisted-by: Opus 4.6
@Jakob-Naucke Jakob-Naucke force-pushed the sep-compute-pcrs-concerns branch from d0441b1 to 3409d7a Compare June 26, 2026 10:49
Comment thread compute-pcrs/src/main.rs
let status = ApprovedImageStatus { conditions };
update_status!(approved_images, &args.resource_name, status)?;
// If we see this causing performance problems, consider NoSQL
loop {

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion from @alicefr: instead of "managing" concurrent writes from the job, write back into an extra status field of the ApprovedImage and let the operator collect

@openshift-ci

openshift-ci Bot commented Jun 26, 2026

Copy link
Copy Markdown

@Jakob-Naucke: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/operator-lifecycle-verify 3409d7a link true /test operator-lifecycle-verify

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant