✨ Feat: Bundle Service serving OPA authorization policy bundles#415
✨ Feat: Bundle Service serving OPA authorization policy bundles#415davidhadas wants to merge 1 commit into
Conversation
0f5cc82 to
7bec7d4
Compare
Alan-Cha
left a comment
There was a problem hiding this comment.
Summary
Strong implementation of the bundle-service component with comprehensive test coverage, good architecture documentation, and proper security considerations. The three-tier caching strategy and bounded concurrency design show careful attention to production readiness.
Areas reviewed: Go code (3400+ lines), Kubernetes manifests, CRDs, deployment configs, documentation, shell scripts, Dockerfile
Commits: 1 commit, signed-off: ✅ yes
CI status: 14/15 passing (E2E tests failing)
Required before merge
1. E2E Test Failure (must-fix)
The E2E Tests check is failing after 25m50s runtime. This needs investigation and resolution before merge. Check the test logs at: https://github.com/kagenti/kagenti-operator/actions/runs/27195746396/job/80286789432
2. PR Title Convention (must-fix)
Add emoji prefix per repo conventions. Current: Feat: Bundle Service...
Suggested: ✨ Feat: Bundle Service serving OPA authorization policy bundles
Suggestions (non-blocking)
Code Quality
cmd/bundle-service/main.go: Consider adding graceful shutdown handler to drain in-flight requests during pod restarts (prevents 502s)internal/bundleservice/handler/handler.go: ETag validation logic could be extracted to a helper method for better testability- Consider adding observability metrics for cache hit/miss rates (singleflight pattern is excellent, metrics would make it visible)
Documentation
README.md: Add direct link tohack/bundle-service-kind.shin the bundle service sectionARCHITECTURE.md: Excellent content - consider adding troubleshooting section for common operational issues- Consider documenting the ETag computation strategy in ARCHITECTURE.md (currently only in code comments)
Testing
- Consider adding integration test coverage for multi-tier policy combination logic (unit tests are thorough, integration would validate end-to-end flow)
Security
- NetworkPolicy: ✅ properly restricts ingress to port 8080
- RBAC: ✅ appropriately scoped to read-only AuthorizationPolicies
- Consider documenting threat model for bundle tampering (likely covered by SPIFFE mTLS but worth explicit mention)
What I particularly like
- Design decision to keep authorization logic in CRs - This gives platform engineers full control without code changes. Brilliant.
- Bounded concurrency with singleflight - Protects K8s API server from thundering herd. Production-ready thinking.
- Three-layer caching - ETag → bundle → policy tiers show deep understanding of performance optimization
- Comprehensive test coverage - Every package has tests, conversion helpers are well-tested
- Documentation quality - ARCHITECTURE.md explains not just what but why
Excellent work overall. Fix the E2E tests and title, and this is ready to merge.
|
@Alan-Cha
|
mrsabath
left a comment
There was a problem hiding this comment.
Reviewed the full bundle-service. Really solid foundation — clean package separation, genuinely comprehensive tests, distroless/non-root image, least-privilege RBAC, and the concurrency model (per-client singleflight key, semaphore released on all paths via defer, RWMutex-guarded caches, clean informer lifecycle) all check out. Per-client bundle scoping is correct — no cross-tenant policy leakage — and the tar builder guards against path traversal.
On CI: E2E is red, but the only failing spec is Skill Discovery › should populate linkedSkills from annotation (e2e_test.go:2161) — unrelated to this PR (it touches no skill-discovery code), and main is green on the same suite. Looks like a flake; worth a re-run to get it green before merge rather than merging red.
A few non-blocking notes inline. The one I'd most want tracked: identity today comes from the client-controlled ?spiffe= query parameter with NoopVerifier, which ARCHITECTURE.md documents as intentional (verification deferred upstream, the Verifier interface as the seam, NetworkPolicy limiting ingress to AuthBridge sidecars). That's fine for the MVP trust boundary, but the mTLS/JWT verifier should land before this is reachable beyond the sidecar mesh — might be worth a tracking issue.
Holding my own approval until E2E is green; @Alan-Cha has already approved.
Areas reviewed: Go (identity/handler/provider/cache/watcher/bundle), security/authz, concurrency, RBAC/NetworkPolicy, Dockerfile, CI · Commits: 1, signed-off: yes · CI: E2E failing (unrelated flake — Skill Discovery)
|
|
||
| // NoopVerifier performs no verification. Used when the service does not yet | ||
| // terminate TLS or validate tokens (verification is handled upstream). | ||
| type NoopVerifier struct{} |
There was a problem hiding this comment.
Tracking note (not a blocker): identity is parsed from the client-controlled ?spiffe= query param and NoopVerifier performs no checks, so on its own the service trusts whatever a caller claims. ARCHITECTURE.md documents this as intentional — verification is deferred upstream, the NetworkPolicy restricts ingress to kagenti.dev/authbridge-labeled pods, and this Verifier interface is the seam for the real check. That's a reasonable MVP trust boundary. The ask is just to land an mTLS-SAN or JWT verifier here before the service is reachable beyond the AuthBridge sidecar mesh, since at that point the ?spiffe= param becomes a cross-client bundle-fetch vector. Worth a tracking issue linked off #1789.
| # Allow access to Kubernetes API server | ||
| - to: | ||
| - ipBlock: | ||
| cidr: 0.0.0.0/0 |
There was a problem hiding this comment.
suggestion: egress is 0.0.0.0/0 on 443/6443. The DNS rule above is nicely scoped; for the API-server rule, consider narrowing to the API server's ClusterIP / service CIDR where the environment makes that practical, so the policy actually constrains egress. Fine to leave for a follow-up if the CIDR isn't stable across target clusters.
|
|
||
| RUN CGO_ENABLED=0 GOOS=${TARGETOS:-linux} GOARCH=${TARGETARCH} go build -o bundle-service ./cmd/bundle-service/ | ||
|
|
||
| FROM gcr.io/distroless/static:nonroot |
There was a problem hiding this comment.
nit: gcr.io/distroless/static:nonroot isn't pinned by digest. Pinning to ...@sha256:<digest> makes the build reproducible and satisfies Scorecard's pinned-dependencies check. (Consistent with how the chart-RBAC/Scorecard items have been handled elsewhere — fine to defer.)
| ns := u.GetNamespace() | ||
| slog.Info("namespace policy changed", "namespace", ns) | ||
| w.policyCache.InvalidateNamespace(ns) | ||
| w.etagCache.InvalidateAll() |
There was a problem hiding this comment.
nit (perf, not correctness): on a namespace-tier policy change this calls etagCache.InvalidateAll() / bundleCache.InvalidateAll(), which also evicts clients in other namespaces. Bundles stay correct (the policyCache.InvalidateNamespace(ns) above is properly scoped), so this is just extra cache churn / 200s where a 304 would do. An InvalidateFunc keyed on strings.HasPrefix(agentID, ns+"/") would scope it. Low priority.
|
@mrsabath - |
|
@davidhadas CI has been fixed - please rebase from upstream main and force push so we can check now with full CI |
Introduce the bundle-service component, an HTTP server that assembles and serves OPA-compatible authorization bundles to AuthBridge sidecar clients. Each bundle is tailored to the requesting client's identity and built from three tiers of AuthorizationPolicy CRs: global, namespace, and client-scoped. Key design decisions: - Decision logic lives in the global AuthorizationPolicy CR, not in code, giving platform engineers full control over tier combination semantics (override support, tier removal, AND/OR changes) - Bounded concurrency (semaphore + singleflight) protects the Kubernetes API server and etcd from thundering herd on cluster restart - Three-layer caching (ETag, bundle, policy) minimizes rebuild pressure - ETag-based conditional responses (304 Not Modified) reduce bandwidth Includes: - AuthorizationPolicy CRD types and conversion helpers - Bundle builder (tar.gz with OPA manifest) - Multi-tier policy cache with scope-aware invalidation - Kubernetes informer-based watcher with scope indexing - SPIFFE ID identity parsing - Deployment manifests, RBAC, NetworkPolicy, default global policy CR - Kind development script (hack/bundle-service-kind.sh) - Architecture and operational documentation - Comprehensive unit tests for all packages Signed-off-by: David Hadas <david.hadas@gmail.com>
7bec7d4 to
e2db9e1
Compare
|
What should we use as transport? Bundle Service (HTTP) or ConfigMap Distribution The bundle service current design implemented in this PR uses the standard OPA bundle protocol (HTTP polling) to distribute policies to AuthBridge sidecars. @rhuss proposed that the operator controls the sidecar pod spec, the bundle service will push rendered policies via ConfigMaps instead, eliminating the HTTP layer entirely. Main arguments for the alternative design - ConfigMap transport (the K8s native way)
Arguments for the current design - HTTP transport (the OPA native way)
|
|
Thanks for the summary David. A few points from the thread that I think should be reflected for a balanced picture: Missing from the ConfigMap section:
On "custom code replacing community-maintained code": On "can serve bundles to Authorino and potentially future OPA PDPs": Suggested addition to the summary: |
|
@rhuss Thank you for the additions. To clarify — the summary above was generated by Claude with the goal of being neutral. I asked it to focus on points that hadn't already been addressed in our earlier discussion, keeping a concise view of the alternatives. I think we may be covering the same ground again — I believe most of these points came up previously and were discussed on Slack. On Authorino specifically: I don't see a dependency between Authorino and the agent platform. What I do see is an opportunity for a centralized authorization service when Authorino is deployed as part of the agent platform — a point I covered in our earlier Slack thread. It's also worth noting that even if upstream starts with HTTP transport to explore the benefits it may bring, that doesn't commit downstream to adopting it — this isn't a one-way door. We can continue to learn and adapt the transport in upstream as we go, based on what we observe in practice. Happy to take this to a quick sync if you feel the written thread isn't converging — sometimes a short call is more productive. |

Introduce the bundle-service component, an HTTP server that assembles and serves OPA-compatible authorization bundles to AuthBridge sidecar clients. Each bundle is tailored to the requesting client's identity and built from three tiers of AuthorizationPolicy CRs: global, namespace, and client-scoped.
Key design decisions:
Includes:
Related issue(s)
#1789