Skip to content

Feature/agent platform#529

Draft
allamand wants to merge 8 commits intomainfrom
feature/agent-platform
Draft

Feature/agent platform#529
allamand wants to merge 8 commits intomainfrom
feature/agent-platform

Conversation

@allamand
Copy link
Copy Markdown
Contributor

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

- Add repository overview with architecture and tech stack
- Add GitOps patterns and ArgoCD best practices
- Add Backstage development guidelines with Kro plugin details
- Add Kro (Kubernetes Resource Orchestrator) development guide
- Add Terraform infrastructure guidelines
- Add application development standards for all languages
- Add coding standards and best practices
- Add ML/AI workloads guide (Ray, Kubeflow, MLflow, etc.)
- Add progressive delivery guide (Argo Rollouts, canary deployments)
- Add comprehensive troubleshooting guide

These steering files provide AI agents with deep context about:
- Repository structure and conventions
- Development workflows and patterns
- Testing strategies
- Security best practices
- Common troubleshooting scenarios
- Add DESIGN.md with complete architecture and implementation plan
- Add README.md user guide for deployment and usage
- Add COMPONENTS.md with detailed component specifications
- Add TROUBLESHOOTING.md with diagnostic procedures

Documentation covers:
- GitOps bridge pattern for agent platform integration
- Feature flag mechanism for backward compatibility
- Component details (Kagent, LiteLLM, Langfuse, Jaeger, Tofu Controller, Agent Core)
- Security, monitoring, backup/DR considerations
- Migration guide and troubleshooting procedures
Changes:
- Replace ApplicationSet pattern with individual ArgoCD Applications
- Each component (Kagent, LiteLLM, Agent Gateway, Langfuse, Jaeger, Tofu Controller, Agent Core) is now a separate Application
- Each Application directly references its Helm chart in sample-agent-platform-on-eks repository
- Update architecture diagrams to reflect new pattern
- Update verification commands and troubleshooting steps
- Simplify deployment flow documentation

This approach provides:
- More explicit control over each component
- Easier debugging and management
- Direct chart references without ApplicationSet generator complexity
…ular architecture

- Expanded Epic 8 from 4 generic tasks to 10 detailed Asana-ready tasks
  covering bridge chart, IAM roles, secrets, hub-config, and e2e validation
- Updated DESIGN.md to remove workshop_type references from appmod-blueprints
  (workshop concerns move to platform-engineering-on-eks repo)
- Made Agent Gateway auth provider-agnostic (Keycloak/Cognito/external)
- Parameterized resource prefix throughout (no hardcoded peeks)
- Updated deployment scenarios, FAQ, migration guide for modular architecture
- Added cross-references between UPGRADE-APPROACH.md and DESIGN.md
- Updated task total to 60, timeline to 14-18 weeks
- Added 3 new success criteria for agent platform on modular architecture
- UPGRADE-APPROACH.md: Fixed duplicate Epic 8 issue, appended clean
  Epic 8 detailed breakdown (Tasks 8.1-8.10) with Kro+ACK compositions
- DESIGN.md: Replaced all terraform apply/init/variables.tf references
  with hub-config addons approach and GitOps bootstrap patterns.
  Updated deployment flows, testing, migration guide, FAQ, and DR sections.
  Feature flag now uses hub-config ConfigMap (Level 3) and GitOps commit
  (Level 4) instead of Terraform variables and deploy scripts.
- README.md: Updated Quick Start to use kubectl bootstrap instead of
  terraform apply. Updated Disable section. Parameterized resource prefix.
… patterns/workshop, git credentials, schema versioning, observability config

## CI/CD Integration

### GitLab CI Pipeline
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want to use gitlab for cicd ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For backstage we was wondering if we scope this out if the project and just use the one in CNOE and using existing OSS plugins for kro, gitlab integration…


## ApplicationSet Patterns

### List Generator Pattern
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should include cluster generator here as it is the one we are using most in the project

- Integration tests with AWS
- Deployment validation

## Backstage Integration
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree on this. Really backstage should not create things in Kubernetes êtes directly but only create things through gitops : PR, push

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be fixed.


### Cluster Design Principles
1. **Multi-AZ Deployment**: Spread across 3 availability zones
2. **Managed Node Groups**: Use EKS managed node groups
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don’t want to use managed nodes groupes but Eks auto mode

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be fixed.###

}
```

### IRSA Configuration
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this


#### ModelConfig CRD

```yaml
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add another example pointing to a ray endpoint in the cluster instead of bedrock. ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be fixed.

model: anthropic.claude-3-5-sonnet-20241022-v2:0
region: us-east-1

# Service account with IRSA
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use pod identity instead of IRSA ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be fixed.


### Overview

Agent Core Components provision AWS Bedrock Agent Core capabilities (Memory, Browser, Code Interpreter) using Tofu Controller.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we deploy this using kro/ACK instead of open tofu/terrzform ? To have consistency with the platform

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no support for ACK support for Agent core components. So this is the only approach we did POC.


> **Note**: There is no Terraform in the `appmod-blueprints` solution repo. Initial EKS cluster creation (via Terraform, CDK, eksctl, etc.) lives in the customer's own infra repo or the workshop repo (`platform-engineering-on-eks`). Once the hub cluster exists, it self-manages via Kro+ACK/CrossPlane compositions and ArgoCD.

### Changes in `sample-agent-platform-on-eks` Repository
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the need to create another git repo for that ? We should only use appmod-blueprints Also for agents

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is absolutely required for extending the platform to agent platform work we are doing separate. Thats the core reason for this refactor. This is our core tenet

Comment thread docs/UPGRADE-APPROACH.md Outdated
3. **Config-external**: `hub-config.yaml` lives outside the repo; customers pass their own config. The config drives Kro/CrossPlane compositions and ArgoCD bootstrap.
4. **Provider-agnostic**: Git provider (GitHub vs GitLab vs CodeCommit), OIDC provider, and CI/CD provider are swappable via configuration
5. **GitOpsy spokes**: Spoke clusters provisioned and managed via CrossPlane/Kro through the hub cluster — same mechanism as hub self-management
6. **Workshop as a pattern, not a fork**: Workshop-specific code lives in `patterns/workshop/` within the main repo (alongside other consumption patterns like `patterns/hub-only/`, `patterns/full-platform/`). The workshop pattern includes CloudFront, GitLab integration, Identity Center setup, and workshop-specific configurations. Heavy workshop orchestration (Terraform for cluster creation, deploy scripts) lives in the internal `platform-engineering-on-eks` GitLab repo.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

platform specific will be in the pattern repo. there is no terraform code to be in internal platform-engineering-on-eks, we just reuse the generic code to create hub. specifics scripts will also leave in tha workshop pattern scrips dir as they only apply on how we deploy the platform and ca be reference by users wanting to use other patterns as well

Comment thread docs/UPGRADE-APPROACH.md
│ └── README.md # Workshop deployment guide (references platform-engineering-on-eks)
├── applications/ # UNCHANGED: Sample apps
├── backstage/ # UNCHANGED: Backstage IDP
├── gitops/ # REFACTORED: GitOps configurations
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is refactor here ? looks UNCHANGED for me

Comment thread docs/UPGRADE-APPROACH.md Outdated
```

> **Key changes**:
> - The `platform/infra/terraform/` directory is removed from `appmod-blueprints`. All Terraform code for cluster creation, GitLab PATs, and workshop-specific infra moves to the `platform-engineering-on-eks` internal GitLab repo. The solution repo is purely GitOps-native for ongoing management.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to keep this terraform cluster to create hub cluster in this repo. it is optional. workshop pattern and full will use it, while other patterns may use existing clusters. there is no point/advantages moving this to gitlab

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Terraform module to create hub cluster stays in this repo.

Comment thread docs/UPGRADE-APPROACH.md Outdated
> **Key changes**:
> - The `platform/infra/terraform/` directory is removed from `appmod-blueprints`. All Terraform code for cluster creation, GitLab PATs, and workshop-specific infra moves to the `platform-engineering-on-eks` internal GitLab repo. The solution repo is purely GitOps-native for ongoing management.
> - `modules/hub-provisioning/` provides a turnkey Terraform module that customers can `source` from GitHub to provision the hub cluster and bootstrap the platform. After bootstrap, the platform is self-managing.
> - `examples/` is renamed to `patterns/` to better reflect that these are consumption patterns, not just examples. The `workshop/` pattern is a first-class citizen alongside other patterns. Workshop-specific configuration (CloudFront, GitLab, Identity Center) lives in `patterns/workshop/`; heavy workshop orchestration (Terraform, deploy scripts) lives in `platform-engineering-on-eks`.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no examples/ folder in current setup, what are you refering to ?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no examples/ folder in current setup, what are you refering to ?

@allamand the proposal is to have a folder called patterns or blueprints - see discussion on the slack channel. Under patterns we can have a workshop folder that will contain workshop specific content. This is to address the fact that we cannot move all workshop specific content to the gitlab for workshop content.

Comment thread docs/UPGRADE-APPROACH.md Outdated
> **Key changes**:
> - The `platform/infra/terraform/` directory is removed from `appmod-blueprints`. All Terraform code for cluster creation, GitLab PATs, and workshop-specific infra moves to the `platform-engineering-on-eks` internal GitLab repo. The solution repo is purely GitOps-native for ongoing management.
> - `modules/hub-provisioning/` provides a turnkey Terraform module that customers can `source` from GitHub to provision the hub cluster and bootstrap the platform. After bootstrap, the platform is self-managing.
> - `examples/` is renamed to `patterns/` to better reflect that these are consumption patterns, not just examples. The `workshop/` pattern is a first-class citizen alongside other patterns. Workshop-specific configuration (CloudFront, GitLab, Identity Center) lives in `patterns/workshop/`; heavy workshop orchestration (Terraform, deploy scripts) lives in `platform-engineering-on-eks`.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all workshop specific patterns should be there, so user can see full picture on how we did it, don't hide things in internal repo

Comment thread docs/UPGRADE-APPROACH.md
eksctl create cluster --name hub --region us-west-2

# 2. Install ArgoCD
helm install argocd argo/argo-cd -n argocd --create-namespace
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

activte EKS capabilities

Comment thread docs/UPGRADE-APPROACH.md Outdated

### 3.4 Phase 4: Workshop Isolation

**Goal**: Move all workshop-specific code (including ALL Terraform) to the internal `platform-engineering-on-eks` GitLab repo. The `appmod-blueprints` repo becomes a clean, customer-facing, GitOps-native solution with zero Terraform and zero workshop concerns.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, just in pattern/workshop/ folder, not internal gitlab

Comment thread docs/UPGRADE-APPROACH.md Outdated

#### 3.5.2 Target State

- Hub cluster creation is done once by any tool (eksctl, CDK, TF, CLI) — this is outside `appmod-blueprints`
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but we provide few options there who people that don't have existing clusters

Comment thread docs/UPGRADE-APPROACH.md
- CloudFront via ACK CloudFront controller (optional)
- Observability via ACK Grafana/Prometheus (optional)
- Pod Identity via native K8s resources
- Spoke clusters are provisioned exclusively via Kro RGDs or CrossPlane compositions from the hub
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say, this does not matter, users will have choice to provision them using any tools, we privide a way to do it with kro/ACK, that could also be crossplane, pullumi, terraform, eksctl, console... that does not matters, we just need to show them how to create/register the cluster secret, and which IAM Rolke to add in the EKS Access entries, then the platform will register it automatically with Argo and bootstrap it as a fleet member

Comment thread docs/UPGRADE-APPROACH.md Outdated
- Pod Identity via native K8s resources
- Spoke clusters are provisioned exclusively via Kro RGDs or CrossPlane compositions from the hub
- ArgoCD ApplicationSets auto-discover and bootstrap new spokes
- Backstage templates allow self-service spoke creation
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to add also here an Agentic way to add clusters using our solution that uses agentic.

so users can either :

  • use basckstage
  • use agent
  • use native gitops integration

to create new spoke clusters, or any apps

@elamaran11
Copy link
Copy Markdown
Contributor

elamaran11 commented Mar 13, 2026

@allamand Thanks for the feedback. We will be implementing some of these but rest is not part of the tenets we decided upon for this approach. Im happy to have you as part of this refactor effort. Following feedback is incorporated:

EKS Auto Mode — replaced "node groups" with "EKS Auto Mode (no managed node groups)" throughout; added to design principles, executive summary, hub-config example, cluster stack description, hub-provisioning module, bootstrap guide, and test steps.

Pod Identity, not IRSA — added as design principle #10, added pod_identity: true to hub-config example with explicit "not IRSA" comments. IRSA was not previously referenced in this doc, so the additions make the Pod Identity preference explicit.

Backstage GitOps-only — updated all Backstage template references to emphasize PR/push through GitOps, not direct kubectl apply. Updated Task 3.6, Task 5.3, the Asana tables, and the spoke creation flow.

Cluster generator for ApplicationSets — added "cluster generator pattern" to fleet descriptions in both repo structures, the ApplicationSets change table, and the spoke auto-discovery section.

No examples/ folder — fixed the incorrect "renamed from examples/" references to clarify patterns/ is a new directory.

Activate EKS capabilities — updated the customer bootstrap flow to include aws eks update-cluster-config for enabling Auto Mode capabilities (compute, networking, storage, load balancing) after cluster creation.

@elamaran11 elamaran11 marked this pull request as draft March 13, 2026 15:00
elamaran11 and others added 2 commits March 13, 2026 11:10
…tOps-only, cluster generator, activate EKS capabilities, fix examples/ reference
…workshop/

Relocate TF/scripts destination from platform-engineering-on-eks internal
GitLab repo to patterns/workshop/terraform/ and patterns/workshop/scripts/
within appmod-blueprints. The internal repo contains only workshop content
and instructions, not infrastructure code.

Key changes:
- Task 1.5: 'Move TF to external repo' → 'Relocate to patterns/workshop/'
- Task 2.3: simplified dependencies (no longer depends on Epic 4)
- Epic 4: 'Move to external repo' → 'Reorganize within appmod-blueprints'
- Tasks 4.1-4.7: rewritten for internal reorganization
- Target repo structure: added patterns/workshop/terraform/ and scripts/
- All impact tables, risk mitigations, execution order updated
- DESIGN.md: 4 references updated to patterns/workshop/
- References to platform-engineering-on-eks: 87→10 (all content/instructions role)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants