sigstore: Cache verify_image results across policy evaluations#3136
sigstore: Cache verify_image results across policy evaluations#3136arewm wants to merge 17 commits into
Conversation
3950ea8 to
b0710ba
Compare
b0710ba to
d029230
Compare
Codecov Report✅ All modified and coverable lines are covered by tests.
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Review Summary by QodoCache verify_image results across policy evaluations
WalkthroughsDescription• Add caching layer for verify_image results using sync.Map and singleflight.Group • Cache key combines image reference and SHA256 hash of verification options • Prevents redundant signature verification across multiple policy evaluations • Add comprehensive test coverage for cache hit/miss scenarios Diagramflowchart LR
A["sigstoreVerifyImage called"] --> B{"Cache hit?"}
B -->|Yes| C["Return cached result"]
B -->|No| D["Use singleflight dedup"]
D --> E["doVerifyImage execution"]
E --> F["Store result in cache"]
F --> G["Return result"]
C --> H["Result"]
G --> H
File Changes1. internal/rego/sigstore/sigstore.go
|
Code Review by Qodo
1. Caches failed verifications
|
d029230 to
c504a19
Compare
| resultIface, err, _ := verifyImageFlight.Do(cacheKey, func() (interface{}, error) { | ||
| return doVerifyImage(bctx, logger, uri, optsTerm) | ||
| }) | ||
| if err != nil { | ||
| return nil, err | ||
| } | ||
|
|
||
| result := resultIface.(*ast.Term) | ||
| verifyImageCache.Store(cacheKey, result) | ||
| return result, nil |
There was a problem hiding this comment.
1. Caches failed verifications 🐞 Bug ⛯ Reliability
verify_image stores results in the global cache unconditionally, including failure results (success=false) returned via signatureFailedResult. This can turn transient verification errors into persistent failures for the lifetime of the process, preventing retries in later policy evaluations/workers.
Agent Prompt
### Issue description
`ec.sigstore.verify_image` unconditionally caches the returned result term, including failure results. This can make transient verification failures persist across later policy evaluations within the same process.
### Issue Context
`doVerifyImage` converts errors into a normal `{success:false, errors:[...]}` term via `signatureFailedResult`, so the cache sees failures as valid values to store.
### Fix Focus Areas
- internal/rego/sigstore/sigstore.go[131-150]
- internal/rego/sigstore/sigstore.go[350-379]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
There was a problem hiding this comment.
- Retries are exhausted before results reach the cache
- A genuine verification failure would cause policy failure regardless
- The misleading-error-pattern concern is real but rare, and the performance cost of not caching failures is a real tradeoff
I can update this caching behavior if desired to not cache failures. If we were to do this then a valid bad signature would result in a much slower evaluation.
With caching failures then we potentially get global signature failures for a transient issue on one task.
I'll need to Conforma team to help decide which option is more reasonable. I am leaning towards not caching failures though.
There was a problem hiding this comment.
Question:
- Retries are exhausted before results reach the cache 2. A genuine verification failure would cause policy failure regardless 3. The misleading-error-pattern concern is real but rare, and the performance cost of not caching failures is a real tradeoff I can update this caching behavior if desired to not cache failures. If we were to do this then a valid bad signature would result in a much slower evaluation. With caching failures then we potentially get global signature failures for a transient issue on one task. I'll need to Conforma team to help decide which option is more reasonable. I am leaning towards not caching failures though.
Answer:
Add an instruction to make the agent treat caching/performance/reliability findings as “tradeoff discussions” unless it can prove a concrete bug, and to explicitly discuss both sides (transient failures vs. cost of retries) before recommending a change.
Suggested config:
[pr_code_suggestions]
extra_instructions = """\
When suggesting changes around caching, retries, or global state:
- Treat them as tradeoffs unless you can demonstrate a concrete incorrect behavior from the diff.
- Always include a short section: "Why this might be a bug" vs "Why this might be intentional".
- Explicitly call out performance impact (extra verification calls), reliability impact (transient failures persisting), and scope (process-wide vs request/worker).
- If the “right” behavior depends on team policy, ask a clarifying question and propose 2-3 options rather than a single directive fix.
- Avoid strong wording like "will prevent retries" unless the code clearly shows retries are possible later and the cache blocks them.
"""If you also want fewer “debatable” suggestions overall, consider making suggestions more selective:
[pr_code_suggestions]
suggestions_depth = "selective"Relevant Sources:
- https://docs.qodo.ai/qodo-documentation/code-review/qodo-merge/tools/improve#extra-instructions-and-best-practices
- https://docs.qodo.ai/qodo-documentation/code-review/qodo-merge/tools/improve#controlling-suggestions-depth
- https://docs.qodo.ai/qodo-documentation/code-review/qodo-merge/usage-guide/additional_configurations#extra-instructions
|
Here's the performance stats I ran. We should try and look for improvements. If we can't, I think we'll have to send out communication on having to up memory requirements. Current PRPeak RSS (MB)
Wall Time (s)
main branchPeak RSS (MB)
Wall Time (s)
|
|
@joejstuart , thanks for that analysis. Is this increase just if we are adding the signature verification? So users would not need to increase memory requirements unless they used that functionality? If we were trying to keep ~200 components to finish evaluation in under 5 minutes, then we also fail that goal with the associated policy changes. |
I tested it with the corresponding policy change. Should I test without it? |
|
I think we needed the test with the policy change to understand the impact of that change. But it would be good to test it without as well to ensure that we don't have regression when that policy isn't being used. |
|
Without the policy change Peak RSS (MB)
Wall Time (s)
|
|
Thanks, @joejstuart . These are just single runs or averages over multiple runs? |
|
@joejstuart @simonbaird , would you be okay with this change along with conforma/policy#1680 ? |
|
Now I'm not seeing much of a performance difference between this and |
|
/ok-to-test |
c8a13b9 to
c97787f
Compare
|
Caution Review failedAn error occurred during the review process. Please try again later. 📝 WalkthroughWalkthroughThe PR adds verification result caching to ChangesSigstore verification caching
🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Rebased to see if it gets closer to green. |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
internal/rego/sigstore/sigstore.go (1)
480-483: ⚡ Quick win
ClearCachesreassigning thesync.Mapvalue is racy.
verifyImageCacheis a package-level value (not a pointer); reassigning it while other goroutines may still be callingLoad/Storeon it is a data race. It happens to work in serial test runs, butgo testcan run tests from other packages or the same package in parallel, and the singleflight callback above also touches it from inner goroutines. Prefer clearing in place.-func ClearCaches() { - verifyImageCache = sync.Map{} -} +func ClearCaches() { + verifyImageCache.Range(func(k, _ any) bool { + verifyImageCache.Delete(k) + return true + }) +}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/rego/sigstore/sigstore.go` around lines 480 - 483, ClearCaches currently reassigns the package-level sync.Map verifyImageCache which is racy; instead clear the map in place to avoid races: in ClearCaches iterate the existing verifyImageCache (use verifyImageCache.Range) and delete each key (verifyImageCache.Delete) so any concurrent Load/Store operations remain safe; update the ClearCaches function to perform in-place deletion rather than reassigning verifyImageCache.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/rego/sigstore/sigstore_test.go`:
- Around line 224-241: The newFakeContext helper misses a mock for
fake.FakeClient.HasBundles which doVerifyImage calls before
VerifyImageSignatures; update newFakeContext to add an expectation on c (the
*fake.FakeClient) for HasBundles (use mock.Anything for args) and return the
same boolean/error used in other tests (e.g., false, nil) so the HasBundles call
is explicitly verified before VerifyImageSignatures is exercised.
In `@internal/rego/sigstore/sigstore.go`:
- Around line 140-149: The current wrapper unconditionally stores the term
returned by doVerifyImage in verifyImageCache via verifyImageCache.Store,
causing transient verification failures to become permanently cached; change the
flow so only cacheable (deterministic/success) results are stored: have
doVerifyImage indicate cacheability (e.g., return a second bool or make the
ast.Term include a "success"/"cacheable" flag) and update the
verifyImageFlight.Do callback/result handling to inspect that flag and call
verifyImageCache.Store only when cacheable/successful; keep verifyImageFlight.Do
and verifyImageCache usage but skip storing non-cacheable failure terms so
transient errors are not persisted.
---
Nitpick comments:
In `@internal/rego/sigstore/sigstore.go`:
- Around line 480-483: ClearCaches currently reassigns the package-level
sync.Map verifyImageCache which is racy; instead clear the map in place to avoid
races: in ClearCaches iterate the existing verifyImageCache (use
verifyImageCache.Range) and delete each key (verifyImageCache.Delete) so any
concurrent Load/Store operations remain safe; update the ClearCaches function to
perform in-place deletion rather than reassigning verifyImageCache.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: 2a574443-5227-40c2-9e2c-e9fffd56ad98
📒 Files selected for processing (2)
internal/rego/sigstore/sigstore.gointernal/rego/sigstore/sigstore_test.go
| // Use singleflight to deduplicate concurrent requests. doVerifyImage never returns a Go | ||
| // error — failures are embedded in the result term — so the error is intentionally discarded. | ||
| resultIface, _, _ := verifyImageFlight.Do(cacheKey, func() (interface{}, error) { | ||
| return doVerifyImage(bctx, logger, uri, optsTerm) | ||
| }) | ||
|
|
||
| result := resultIface.(*ast.Term) | ||
| verifyImageCache.Store(cacheKey, result) | ||
| return result, nil | ||
| } |
There was a problem hiding this comment.
Transient verification failures are cached for the process lifetime.
doVerifyImage embeds all failures in the returned result term (per the file-level comment, the builtin never returns a Go error). The wrapper stores that result unconditionally in verifyImageCache, so any transient error — e.g. a registry/network failure surfacing from VerifyImageSignatures/VerifyImageAttestations, or a Rekor lookup hiccup — becomes a sticky “failed” verdict for ref|opts for the rest of the process. Subsequent components sharing the same pinned bundle will then deterministically report failure even though a fresh call would succeed. For long evaluations (the 191-component runs mentioned in the PR) this can convert a single flaky call into a wave of false negatives.
Consider only caching successful results, or distinguishing deterministic-input failures (new digest, opts parameter) from verification-call failures and not caching the latter.
🛡️ Sketch: only cache successes
Have doVerifyImage return (*ast.Term, bool /*cacheable*/, error) (or inspect result.Get("success")) and skip verifyImageCache.Store when the result represents a non-deterministic failure path.
- resultIface, _, _ := verifyImageFlight.Do(cacheKey, func() (interface{}, error) {
- return doVerifyImage(bctx, logger, uri, optsTerm)
- })
-
- result := resultIface.(*ast.Term)
- verifyImageCache.Store(cacheKey, result)
- return result, nil
+ resultIface, _, _ := verifyImageFlight.Do(cacheKey, func() (interface{}, error) {
+ return doVerifyImage(bctx, logger, uri, optsTerm)
+ })
+ result := resultIface.(*ast.Term)
+ if success, ok := result.Get(ast.StringTerm("success")).Value.(ast.Boolean); ok && bool(success) {
+ verifyImageCache.Store(cacheKey, result)
+ }
+ return result, nil📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Use singleflight to deduplicate concurrent requests. doVerifyImage never returns a Go | |
| // error — failures are embedded in the result term — so the error is intentionally discarded. | |
| resultIface, _, _ := verifyImageFlight.Do(cacheKey, func() (interface{}, error) { | |
| return doVerifyImage(bctx, logger, uri, optsTerm) | |
| }) | |
| result := resultIface.(*ast.Term) | |
| verifyImageCache.Store(cacheKey, result) | |
| return result, nil | |
| } | |
| // Use singleflight to deduplicate concurrent requests. doVerifyImage never returns a Go | |
| // error — failures are embedded in the result term — so the error is intentionally discarded. | |
| resultIface, _, _ := verifyImageFlight.Do(cacheKey, func() (interface{}, error) { | |
| return doVerifyImage(bctx, logger, uri, optsTerm) | |
| }) | |
| result := resultIface.(*ast.Term) | |
| if success, ok := result.Get(ast.StringTerm("success")).Value.(ast.Boolean); ok && bool(success) { | |
| verifyImageCache.Store(cacheKey, result) | |
| } | |
| return result, nil | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/rego/sigstore/sigstore.go` around lines 140 - 149, The current
wrapper unconditionally stores the term returned by doVerifyImage in
verifyImageCache via verifyImageCache.Store, causing transient verification
failures to become permanently cached; change the flow so only cacheable
(deterministic/success) results are stored: have doVerifyImage indicate
cacheability (e.g., return a second bool or make the ast.Term include a
"success"/"cacheable" flag) and update the verifyImageFlight.Do callback/result
handling to inspect that flag and call verifyImageCache.Store only when
cacheable/successful; keep verifyImageFlight.Do and verifyImageCache usage but
skip storing non-cacheable failure terms so transient errors are not persisted.
|
/ok-to-test |
|
@st3penta , did you move this to draft because there is something conflicting or just because there hasn't been much review volume? Are there any concerns with merging this? |
|
hi @arewm, i moved it to draft because it seemed a bit stale, but talking with the team we agreed we should instead get it green and reviewed, and merge it |
|
|
||
| result := resultIface.(*ast.Term) | ||
| verifyImageCache.Store(cacheKey, result) | ||
| return result, nil |
There was a problem hiding this comment.
[high] logic-error
Error results (including transient failures) are cached permanently. When doVerifyImage encounters a transient error (network timeout, registry unavailability), it returns a failure *ast.Term with success: false. This result is unconditionally stored via verifyImageCache.Store(cacheKey, result). Subsequent calls with the same ref+opts return the cached failure without retrying. This contradicts the pattern in oci.go where .Store() is called only on the success path.
Suggested fix: Only cache successful results. Check the success field of the result term before storing, or restructure doVerifyImage to return nil on failure (matching the oci.go pattern) and only call verifyImageCache.Store when result is non-nil.
| } | ||
|
|
||
| // Use singleflight to deduplicate concurrent requests. doVerifyImage never returns a Go | ||
| // error — failures are embedded in the result term — so the error is intentionally discarded. |
There was a problem hiding this comment.
[medium] error-handling-gap
The singleflight result and error are both discarded. While doVerifyImage currently never returns a Go error, its signature is (*ast.Term, error). If later modified to return an error, resultIface will be nil and the type assertion will panic. The oci.go pattern handles this with a nil check.
Suggested fix: Add a nil check on resultIface before the type assertion, or change doVerifyImage to return only *ast.Term (no error).
| return cached.(*ast.Term), nil | ||
| } | ||
|
|
||
| // Use singleflight to deduplicate concurrent requests. doVerifyImage never returns a Go |
There was a problem hiding this comment.
[low] authorization-bypass-stale-cache
The verify_image cache has no TTL, eviction, or invalidation. Unlike OCI caches (content-addressed immutable data), signature verification results can change due to key revocation or certificate expiry. In a short-lived CLI process this is unlikely to be exploitable.
Suggested fix: Document the assumption that the cache is scoped to a single CLI invocation, or add a TTL.
| } | ||
| logger = logger.WithField("ref", string(uri)) | ||
|
|
||
| // Build cache key from ref + opts hash |
There was a problem hiding this comment.
[info] cache-key-missing-context
The cache key is constructed from ref and hash(opts) only, without including context identity (credentials). Consistent with oci.go pattern but worth documenting the single-invocation assumption.
The rpms-signature-scan task is in the main catalog now, so no need to use the non-standard catalog. Also update the digest.
Just noticed it needed fixing while creating previous commit.
I used the bash script like this:
curl -sL https://github.com/simonbaird/konflux-pipeline-patcher/raw/main/pipeline-patcher | bash -s bump-task-refs
Note that MintMaker has been disabled recently, so I think the usual
automated PRs are not available.
Commit created like this: hack/go-mod-upgrade-helper github.com/containerd/containerd --ignore-tidy-error
Needed for the next commit.
Commit created like this: hack/go-mod-upgrade-helper github.com/containerd/containerd/v2 --ignore-tidy-error
|
Might need a rebase on #3357 |
It has gone stale lately, but it was rebased/refreshed recent in EC-1866. Make sure we're using the refreshed version. The upstream version we're now using is based on v0.27.7. Ref: https://redhat.atlassian.net/browse/EC-1866 Co-authored-by: Claude Code <noreply@anthropic.com>
I always include the ubi bump commit in a larger chore PR, so making a fresh branch and pushing it is pretty much never what I want.
Fixes CI build failure where kubectl v0.34.2 expected k8s.io/api/scheduling/v1alpha1 which was removed in v0.36.0. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This helped me test the dependency change fixes in the previous commit.
This fixes some go mod tidy errors that we've been ignoring for a
while.
Older knative.dev/eventing versions import
knative.dev/pkg/{metrics,tracing/config} which have been removed
upstream. Fix by excluding v0.30.3 in tools and requiring minimum
v0.49.2 in acceptance (the first version that dropped the removed
imports).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…36.0 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
My hope is that this gets our acceptance tests, particularly the task_validate_image feature, passing again. See conforma/go-containerregistry#7
Add a sync.Map + singleflight.Group cache for ec.sigstore.verify_image results keyed by ref + opts hash. OPA's Memoize only deduplicates within a single Eval() call, but components are validated in separate Eval() calls. Since bundles are pinned by digest, verification results are stable and can be safely cached for the process lifetime. This prevents redundant signature verification when many components share the same task bundles (e.g., 100 components using git-clone). Ref: EC-1545 Assisted-by: Claude Code (Sonnet 4.6) Signed-off-by: arewm <arewm@users.noreply.github.com>
The truncated 8-byte SHA-256 is vulnerable to offline collision attacks since cache key inputs are observable across invocations. SHA-512 provides adequate post-quantum collision resistance for this security-sensitive cache key. Assisted-by: Claude Code (Sonnet 4.6) Signed-off-by: arewm <arewm@users.noreply.github.com>
Should fix the current unit test failure. Co-authored-by: Claude Code <noreply@anthropic.com>
e0a1cdb to
975e565
Compare
|
I rebased on all the commits from #3357 to see if it goes green. Would probably want to rebase again on updated main once 3357 is merged. |
|
🤖 Finished Review · ✅ Success · Started 7:59 PM UTC · Completed 8:12 PM UTC |
| resultIface, _, _ := verifyImageFlight.Do(cacheKey, func() (interface{}, error) { | ||
| return doVerifyImage(bctx, logger, uri, optsTerm) | ||
| }) | ||
|
|
There was a problem hiding this comment.
[high] logic-error
verifyImageCache.Store(cacheKey, result) is called outside the singleflight callback and unconditionally caches both success and failure results. Transient failures (network errors, DNS blips) are permanently cached for the process lifetime. The OCI caching pattern in internal/rego/oci/oci.go stores inside the singleflight callback and only on the success path.
Suggested fix: Move verifyImageCache.Store inside the singleflight callback and only store when result indicates success.
| @@ -210,6 +211,92 @@ func TestSigstoreVerifyImage(t *testing.T) { | |||
| } | |||
There was a problem hiding this comment.
[medium] test-inadequate
TestVerifyImageCache only covers the happy path. No test exists for when verification fails, which per the current implementation would be cached permanently.
Suggested fix: Add a test case where the mock returns an error on VerifyImageSignatures, then verify that a subsequent call retries rather than returning the cached failure.
| logger.Debug("using cached image signature verification result") | ||
| return cached.(*ast.Term), nil | ||
| } | ||
|
|
There was a problem hiding this comment.
[medium] fail-open
The process-level sync.Map cache stores signature verification results indefinitely with no TTL, no size bound, and no invalidation. ClearCaches() exists but is only called in test code. Stale cached success results would persist after key rotation or signature revocation.
Suggested fix: Ensure ClearCaches() is called between policy evaluation runs in production code, or scope the cache to a single evaluation run via context.
|
/ok-to-test |
Summary
sync.Map+singleflight.Groupcache forec.sigstore.verify_imageresults keyed byref|hash(opts)manifestCache/descriptorCacheininternal/rego/oci/oci.goTest plan
go build ./internal/rego/sigstore/compiles cleanlyRef: EC-1545
🤖 Generated with Claude Code