Skip to content

✨ Feat: Bundle Service serving OPA authorization policy bundles#415

Open
davidhadas wants to merge 1 commit into
kagenti:mainfrom
davidhadas:bundle-service
Open

✨ Feat: Bundle Service serving OPA authorization policy bundles#415
davidhadas wants to merge 1 commit into
kagenti:mainfrom
davidhadas:bundle-service

Conversation

@davidhadas

@davidhadas davidhadas commented Jun 9, 2026

Copy link
Copy Markdown

Introduce the bundle-service component, an HTTP server that assembles and serves OPA-compatible authorization bundles to AuthBridge sidecar clients. Each bundle is tailored to the requesting client's identity and built from three tiers of AuthorizationPolicy CRs: global, namespace, and client-scoped.

Key design decisions:

  • Decision logic lives in the global AuthorizationPolicy CR, not in code, giving platform engineers full control over tier combination semantics (override support, tier removal, AND/OR changes)
  • Bounded concurrency (semaphore + singleflight) protects the Kubernetes API server and etcd from thundering herd on cluster restart
  • Three-layer caching (ETag, bundle, policy) minimizes rebuild pressure
  • ETag-based conditional responses (304 Not Modified) reduce bandwidth

Includes:

  • AuthorizationPolicy CRD types and conversion helpers
  • Bundle builder (tar.gz with OPA manifest)
  • Multi-tier policy cache with scope-aware invalidation
  • Kubernetes informer-based watcher with scope indexing
  • SPIFFE ID identity parsing
  • Deployment manifests, RBAC, NetworkPolicy, default global policy CR
  • Kind development script (hack/bundle-service-kind.sh)
  • Architecture and operational documentation
  • Comprehensive unit tests for all packages

Related issue(s)

#1789

@davidhadas davidhadas requested a review from a team as a code owner June 9, 2026 08:02
@davidhadas davidhadas changed the title Add bundle service for serving OPA authorization policy bundles Feat: Bundle Service serving OPA authorization policy bundles Jun 9, 2026
@davidhadas davidhadas force-pushed the bundle-service branch 2 times, most recently from 0f5cc82 to 7bec7d4 Compare June 9, 2026 09:06

@Alan-Cha Alan-Cha left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Strong implementation of the bundle-service component with comprehensive test coverage, good architecture documentation, and proper security considerations. The three-tier caching strategy and bounded concurrency design show careful attention to production readiness.

Areas reviewed: Go code (3400+ lines), Kubernetes manifests, CRDs, deployment configs, documentation, shell scripts, Dockerfile
Commits: 1 commit, signed-off: ✅ yes
CI status: 14/15 passing (E2E tests failing)


Required before merge

1. E2E Test Failure (must-fix)

The E2E Tests check is failing after 25m50s runtime. This needs investigation and resolution before merge. Check the test logs at: https://github.com/kagenti/kagenti-operator/actions/runs/27195746396/job/80286789432

2. PR Title Convention (must-fix)

Add emoji prefix per repo conventions. Current: Feat: Bundle Service...
Suggested: ✨ Feat: Bundle Service serving OPA authorization policy bundles


Suggestions (non-blocking)

Code Quality

  • cmd/bundle-service/main.go: Consider adding graceful shutdown handler to drain in-flight requests during pod restarts (prevents 502s)
  • internal/bundleservice/handler/handler.go: ETag validation logic could be extracted to a helper method for better testability
  • Consider adding observability metrics for cache hit/miss rates (singleflight pattern is excellent, metrics would make it visible)

Documentation

  • README.md: Add direct link to hack/bundle-service-kind.sh in the bundle service section
  • ARCHITECTURE.md: Excellent content - consider adding troubleshooting section for common operational issues
  • Consider documenting the ETag computation strategy in ARCHITECTURE.md (currently only in code comments)

Testing

  • Consider adding integration test coverage for multi-tier policy combination logic (unit tests are thorough, integration would validate end-to-end flow)

Security

  • NetworkPolicy: ✅ properly restricts ingress to port 8080
  • RBAC: ✅ appropriately scoped to read-only AuthorizationPolicies
  • Consider documenting threat model for bundle tampering (likely covered by SPIFFE mTLS but worth explicit mention)

What I particularly like

  1. Design decision to keep authorization logic in CRs - This gives platform engineers full control without code changes. Brilliant.
  2. Bounded concurrency with singleflight - Protects K8s API server from thundering herd. Production-ready thinking.
  3. Three-layer caching - ETag → bundle → policy tiers show deep understanding of performance optimization
  4. Comprehensive test coverage - Every package has tests, conversion helpers are well-tested
  5. Documentation quality - ARCHITECTURE.md explains not just what but why

Excellent work overall. Fix the E2E tests and title, and this is ready to merge.

@davidhadas davidhadas changed the title Feat: Bundle Service serving OPA authorization policy bundles ✨ Feat: Bundle Service serving OPA authorization policy bundles Jun 9, 2026
@davidhadas

Copy link
Copy Markdown
Author

@Alan-Cha
Thank you for the review

  1. E2e tests failing are not related to the bundle service. Not sure why they occur - maybe prior issue?
  2. Adding EMoji to title leads to failure of PR Verifier

@mrsabath mrsabath left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the full bundle-service. Really solid foundation — clean package separation, genuinely comprehensive tests, distroless/non-root image, least-privilege RBAC, and the concurrency model (per-client singleflight key, semaphore released on all paths via defer, RWMutex-guarded caches, clean informer lifecycle) all check out. Per-client bundle scoping is correct — no cross-tenant policy leakage — and the tar builder guards against path traversal.

On CI: E2E is red, but the only failing spec is Skill Discovery › should populate linkedSkills from annotation (e2e_test.go:2161) — unrelated to this PR (it touches no skill-discovery code), and main is green on the same suite. Looks like a flake; worth a re-run to get it green before merge rather than merging red.

A few non-blocking notes inline. The one I'd most want tracked: identity today comes from the client-controlled ?spiffe= query parameter with NoopVerifier, which ARCHITECTURE.md documents as intentional (verification deferred upstream, the Verifier interface as the seam, NetworkPolicy limiting ingress to AuthBridge sidecars). That's fine for the MVP trust boundary, but the mTLS/JWT verifier should land before this is reachable beyond the sidecar mesh — might be worth a tracking issue.

Holding my own approval until E2E is green; @Alan-Cha has already approved.

Areas reviewed: Go (identity/handler/provider/cache/watcher/bundle), security/authz, concurrency, RBAC/NetworkPolicy, Dockerfile, CI · Commits: 1, signed-off: yes · CI: E2E failing (unrelated flake — Skill Discovery)


// NoopVerifier performs no verification. Used when the service does not yet
// terminate TLS or validate tokens (verification is handled upstream).
type NoopVerifier struct{}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracking note (not a blocker): identity is parsed from the client-controlled ?spiffe= query param and NoopVerifier performs no checks, so on its own the service trusts whatever a caller claims. ARCHITECTURE.md documents this as intentional — verification is deferred upstream, the NetworkPolicy restricts ingress to kagenti.dev/authbridge-labeled pods, and this Verifier interface is the seam for the real check. That's a reasonable MVP trust boundary. The ask is just to land an mTLS-SAN or JWT verifier here before the service is reachable beyond the AuthBridge sidecar mesh, since at that point the ?spiffe= param becomes a cross-client bundle-fetch vector. Worth a tracking issue linked off #1789.

# Allow access to Kubernetes API server
- to:
- ipBlock:
cidr: 0.0.0.0/0

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: egress is 0.0.0.0/0 on 443/6443. The DNS rule above is nicely scoped; for the API-server rule, consider narrowing to the API server's ClusterIP / service CIDR where the environment makes that practical, so the policy actually constrains egress. Fine to leave for a follow-up if the CIDR isn't stable across target clusters.


RUN CGO_ENABLED=0 GOOS=${TARGETOS:-linux} GOARCH=${TARGETARCH} go build -o bundle-service ./cmd/bundle-service/

FROM gcr.io/distroless/static:nonroot

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: gcr.io/distroless/static:nonroot isn't pinned by digest. Pinning to ...@sha256:<digest> makes the build reproducible and satisfies Scorecard's pinned-dependencies check. (Consistent with how the chart-RBAC/Scorecard items have been handled elsewhere — fine to defer.)

ns := u.GetNamespace()
slog.Info("namespace policy changed", "namespace", ns)
w.policyCache.InvalidateNamespace(ns)
w.etagCache.InvalidateAll()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (perf, not correctness): on a namespace-tier policy change this calls etagCache.InvalidateAll() / bundleCache.InvalidateAll(), which also evicts clients in other namespaces. Bundles stay correct (the policyCache.InvalidateNamespace(ns) above is properly scoped), so this is just extra cache churn / 200s where a 304 would do. An InvalidateFunc keyed on strings.HasPrefix(agentID, ns+"/") would scope it. Low priority.

@davidhadas

Copy link
Copy Markdown
Author

@mrsabath -
Created kagenti/kagenti#1860
A second attempt of e2e tests led to FAIL! -- 35 Passed | 1 Failed | 0 Pending | 5 Skipped
Better than before - so instability issue - I will try again

@pdettori

Copy link
Copy Markdown
Member

@davidhadas CI has been fixed - please rebase from upstream main and force push so we can check now with full CI

Introduce the bundle-service component, an HTTP server that assembles
and serves OPA-compatible authorization bundles to AuthBridge sidecar
clients. Each bundle is tailored to the requesting client's identity
and built from three tiers of AuthorizationPolicy CRs: global,
namespace, and client-scoped.

Key design decisions:
- Decision logic lives in the global AuthorizationPolicy CR, not in
  code, giving platform engineers full control over tier combination
  semantics (override support, tier removal, AND/OR changes)
- Bounded concurrency (semaphore + singleflight) protects the
  Kubernetes API server and etcd from thundering herd on cluster restart
- Three-layer caching (ETag, bundle, policy) minimizes rebuild pressure
- ETag-based conditional responses (304 Not Modified) reduce bandwidth

Includes:
- AuthorizationPolicy CRD types and conversion helpers
- Bundle builder (tar.gz with OPA manifest)
- Multi-tier policy cache with scope-aware invalidation
- Kubernetes informer-based watcher with scope indexing
- SPIFFE ID identity parsing
- Deployment manifests, RBAC, NetworkPolicy, default global policy CR
- Kind development script (hack/bundle-service-kind.sh)
- Architecture and operational documentation
- Comprehensive unit tests for all packages

Signed-off-by: David Hadas <david.hadas@gmail.com>
@davidhadas

davidhadas commented Jun 11, 2026

Copy link
Copy Markdown
Author

What should we use as transport?  Bundle Service (HTTP) or ConfigMap Distribution

The bundle service current design implemented in this PR uses the standard OPA bundle protocol (HTTP polling) to distribute policies to AuthBridge sidecars. @rhuss proposed that the operator controls the sidecar pod spec, the bundle service will push rendered policies via ConfigMaps instead, eliminating the HTTP layer entirely.
Propagation delay, Consistency properties both seems similar. The 1MB ConfigMap limit is unlikely to be a practical constraint for Rego policies.

Main arguments for the alternative design - ConfigMap transport (the K8s native way)

  • Eliminates the HTTP serving layer: handler, ETag/304, singleflight, three-tier cache, concurrency control — infrastructure that exists to serve HTTP polling efficiently. ConfigMap writes don't need any of it.
  • Alow us to use K8s-native mechanisms and integrate the functionality to become a component of the operator (operator already watches AgentRuntime CRDs and controls the sidecar in other ways): removes a separate Deployment, Service, ServiceAccount, RBAC, and NetworkPolicy.

Arguments for the current design -  HTTP transport (the OPA native way)

  • OPA SDK embedded in the sidecar support HTTP natively. ConfigMap approach requires designing a custom filesystem-loading to the embedded  OPA — custom code replacing community-maintained code.
  • HTTP is the standard OPA bundle transport mechanism: Keeps the system open — customers can point AuthBridge at any web server to obtain bundles. 

  • An HTTP Bundle service we design can serve bundles to Authorino and potentially future OPA PDPs which become part of the agent platform – this gives us a uniform and consolidated place to build and distribute Access Control rules across the agent platform.

@rhuss

rhuss commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Thanks for the summary David. A few points from the thread that I think should be reflected for a balanced picture:

Missing from the ConfigMap section:

  • Code reduction: the HTTP serving layer (handler, ETag/304, singleflight, concurrency control, SPIFFE parsing, three-layer cache) is ~2,500 lines that goes away entirely. Total custom code drops from ~4,000 to ~1,500 lines.
  • AuthBridge already hot-reloads config from ConfigMaps (confirmed by the AuthBridge team in the thread). This is a proven mechanism, not a hypothetical.
  • No availability dependency: ConfigMaps are in etcd (already required). The bundle service must be running and reachable for every poll cycle.

On "custom code replacing community-maintained code":
This framing is one-sided. The bundle service itself is 4,000 lines of custom code (HTTP handler, caching, singleflight, identity parsing, bundle compilation). The ConfigMap approach reduces total custom code. The client-side adaptation (filesystem loader or rego.New with file watching) replaces community HTTP client code, yes, but the net custom code is significantly less.

On "can serve bundles to Authorino and potentially future OPA PDPs":
This was discussed at length in the thread. Today: Authorino doesn't use OPA bundles, OpenShell doesn't fetch bundles, Keycloak has no OPA support. Building for adoption that hasn't materialized is premature. Additionally, the bundle service is a separate deployment that ships as part of the agent platform. Any external component that depends on it (like Authorino) would require the agent platform to be installed, even when the use case has nothing to do with agents. If this service is meant to be cross-product, it should live as an independent project with its own lifecycle and release process. That packaging perspective should be part of the tradeoff discussion.

Suggested addition to the summary:
A third section covering the operational and packaging dimension would make this complete: deployment footprint, ownership boundaries, and whether the agent platform repo is the right home for a distribution service that aspires to serve components beyond AuthBridge.

@davidhadas

davidhadas commented Jun 11, 2026

Copy link
Copy Markdown
Author

@rhuss Thank you for the additions. To clarify — the summary above was generated by Claude with the goal of being neutral. I asked it to focus on points that hadn't already been addressed in our earlier discussion, keeping a concise view of the alternatives.

I think we may be covering the same ground again — I believe most of these points came up previously and were discussed on Slack.

On Authorino specifically: I don't see a dependency between Authorino and the agent platform. What I do see is an opportunity for a centralized authorization service when Authorino is deployed as part of the agent platform — a point I covered in our earlier Slack thread.

It's also worth noting that even if upstream starts with HTTP transport to explore the benefits it may bring, that doesn't commit downstream to adopting it — this isn't a one-way door. We can continue to learn and adapt the transport in upstream as we go, based on what we observe in practice.

Happy to take this to a quick sync if you feel the written thread isn't converging — sometimes a short call is more productive.

@davidhadas

davidhadas commented Jun 12, 2026

Copy link
Copy Markdown
Author

Here is a diagram to help highlight the deployment options under HTTP and the opportunity for openness it offers.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants