Skip to content

test: add E2E tests for payload processor with Kind cluster CI#31

Open
asaadbalum wants to merge 1 commit into
llm-d:mainfrom
asaadbalum:feat/issue-14-add-e2e-tests
Open

test: add E2E tests for payload processor with Kind cluster CI#31
asaadbalum wants to merge 1 commit into
llm-d:mainfrom
asaadbalum:feat/issue-14-add-e2e-tests

Conversation

@asaadbalum
Copy link
Copy Markdown
Contributor

@asaadbalum asaadbalum commented May 4, 2026

Summary

Add end-to-end tests that deploy a complete Envoy + Payload Processor + model-server-simulator stack on a Kind cluster and validate core functionality through the actual ext_proc gRPC pipeline.

  • Base model routing: Verifies that model field extraction from /v1/chat/completions and /v1/completions bodies routes Llama and DeepSeek requests to the correct pools via X-Gateway-Base-Model-Name header.
  • LoRA adapter routing: Verifies that adapter names are resolved to base models through ConfigMap reconciliation and routed to the correct pool.
  • Streaming: Verifies that streaming requests ("stream": true) return SSE text/event-stream chunks through the full Envoy → Payload Processor → model-server pipeline.
  • Metrics: Verifies that ipp_info and ipp_success_total Prometheus metrics are populated after traffic flows.
  • CI integration: E2E job added to ci-pr-checks.yaml, skipping docs-only changes. Removed unused python-lint and container-build jobs.

Manifest structure

Kubernetes manifests live under deploy/ following the llm-d-router pattern: shared components (deploy/components/) and environment-specific infrastructure (deploy/environments/dev/e2e-infra/). Each component directory includes a kustomization.yaml. Test code references these manifests via relative paths with ${VAR} substitution, enabling reuse for both E2E tests and local Kind development.

The E2E Envoy configuration mirrors production:

New files

File Purpose
deploy/components/ipp/deployment.yaml Payload Processor Deployment (parameterized)
deploy/components/ipp/service.yaml Payload Processor Service
deploy/components/ipp/rbac.yaml ServiceAccount + ClusterRole + ClusterRoleBinding
deploy/components/ipp/kustomization.yaml Kustomize resource list for IPP component
deploy/components/model-server/llama/deployment.yaml Llama simulator + Service + adapter ConfigMap
deploy/components/model-server/llama/kustomization.yaml Kustomize resource list
deploy/components/model-server/deepseek/deployment.yaml DeepSeek simulator + Service + adapter ConfigMap
deploy/components/model-server/deepseek/kustomization.yaml Kustomize resource list
deploy/environments/dev/e2e-infra/envoy.yaml Envoy proxy with ext_proc filter config
deploy/environments/dev/e2e-infra/client.yaml Curl client pod for in-cluster requests
test/e2e/e2e_suite_test.go Ginkgo BeforeSuite/AfterSuite — deploys stack, waits for readiness
test/e2e/e2e_test.go 7 test cases covering all scenarios above
test/e2e/README.md Developer quickstart guide for running E2E locally
test/e2e/TROUBLESHOOTING.md Troubleshooting guide for common issues
hack/test-e2e.sh Shell script orchestrating Kind + image build + test run

Modified files

File Change
.github/workflows/ci-pr-checks.yaml Added e2e job, removed python-lint and container-build
Makefile Added test-e2e, image-build-local, image-kind targets

Test plan

  • All 7 E2E tests pass on Kind (7 Passed | 0 Failed)
  • All existing unit/integration tests pass (go test ./...)
  • Lint clean (go vet + build tags)
  • Manual verification with kubectl exec curl for each scenario
  • CI workflow validated locally (make image-kind && make test-e2e)

Closes #14

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

Unsigned commits detected! Please sign your commits.

For instructions on how to set up GPG/SSH signing and verify your commits, please see GitHub Documentation.

@asaadbalum asaadbalum force-pushed the feat/issue-14-add-e2e-tests branch 3 times, most recently from 9da84d3 to 1b2b395 Compare May 4, 2026 11:54
Copy link
Copy Markdown

@aradhalevy aradhalevy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, some minor comments (and we will need the new added check to run and pass first)

Comment thread test/testdata/e2e-deployment.yaml Outdated
name: llama-adapters
namespace: $E2E_NS
labels:
inference.llm-d.io/ipp-managed: "true"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it supposed to be llm-d.ai instead of llm-d.io as per #28.

After this fix the test pass for me locally

Comment thread test/testdata/e2e-deployment.yaml Outdated
name: deepseek-adapters
namespace: $E2E_NS
labels:
inference.llm-d.io/ipp-managed: "true"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same llm-d.ai instead of llm-d.io

Comment thread test/e2e/README.md
| Streaming routing | SSE chunks returned |
| Metrics | `bbr_info`, `bbr_success_total` |

## Troubleshooting
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I had a couple of other clusters set up in kind, Envoy tried to route requests to them. Please add a suggestion / troubleshooting to use kind delete clusters --all first to clean your kind environment first.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

containers:
- name: payload-processor
image: $E2E_IMAGE
imagePullPolicy: Never
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be IfNotPresent if we want to test on a different cluster other than kind. But that requires pushing an image to ghcr.io and might require some more changes, and can be dealt with in another issue / PR if you prefer to keep this PR for kind only

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping it for now, will address it in a follow-up pr

Comment thread .github/workflows/ci-e2e.yaml Outdated

- name: Run E2E tests
run: |
E2E_IMAGE=ghcr.io/llm-d/llm-d-inference-payload-processor:e2e \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't use the Makefile / script here, I think it would be better to use them to have a single source of truth.

@asaadbalum asaadbalum force-pushed the feat/issue-14-add-e2e-tests branch from 1b2b395 to 9de9955 Compare May 6, 2026 07:57
@github-actions github-actions Bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels May 6, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

⚠️ Large PR detected

Your PR is large. Please consider breaking it into multiple PRs.

The do-not-merge/hold label has been added and can be removed by the reviewers based on their judgement.

Copy link
Copy Markdown

@aradhalevy aradhalevy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
I think this is fine even tough it is a large PR as all the code is needed and relevant to this minimal e2e test suite.

Comment thread test/e2e/e2e_suite_test.go Outdated

var (
testConfig *testutils.TestConfig
ppImage string
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it would be better to align the name on ipp rather than pp (that was the agreed acronym).

Comment thread test/e2e/README.md Outdated
| Base model routing | Pool routing via header |
| LoRA adapter routing | ConfigMap adapter lookup |
| Streaming routing | SSE chunks returned |
| Metrics | `bbr_info`, `bbr_success_total` |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as a follow up, we should update all metrics to be named ipp instead of bbr.

not a blocker

Comment thread test/e2e/README.md
| Streaming routing | SSE chunks returned |
| Metrics | `bbr_info`, `bbr_success_total` |

## Troubleshooting
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should probably go to a separate troubleshot guide.
quickstart guide should be quick, and simple :)
in other words, the simplest explanation of the green path.

Comment thread test/testdata/e2e-deployment.yaml Outdated
@@ -0,0 +1,164 @@
# Llama model server simulator
apiVersion: apps/v1
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain the separation between e2e-deployment and deepseek-model-server?
I see deepseek has deployment + svc.
here I see deployment + svc for a llama plus adapter of deepseek + llama + many other CRs.
not sure I understand the separation.

Comment thread .github/workflows/ci-e2e.yaml Outdated
- '!**/*.md'
- '!LICENSE'
- '!OWNERS'

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move this logic to the file "ci-pr-checks.yaml" (and on the way to clean from it the lint python and build at the end)?

@asaadbalum asaadbalum force-pushed the feat/issue-14-add-e2e-tests branch from 9de9955 to 3c561d1 Compare May 11, 2026 07:13
@github-actions
Copy link
Copy Markdown

⚠️ Large PR detected

Your PR is large. Please consider breaking it into multiple PRs.

The do-not-merge/hold label has been added and can be removed by the reviewers based on their judgement.

@asaadbalum asaadbalum requested a review from nirrozenbaum May 11, 2026 07:17
@nirrozenbaum
Copy link
Copy Markdown
Collaborator

@shmuelk can you please review this PR when you have time?
would be good to validate the tests structure is aligned with scheduler.

@shmuelk
Copy link
Copy Markdown
Collaborator

shmuelk commented May 13, 2026

@nirrozenbaum I took a very quick look at this PR.

I don't like it's structure. This E2E test looks a lot more like the old IGW E2E test and not like the scheduler's E2E test.

@roytman restructured the End to End test and the development environment on Kind to use the same K8S YAML and config YAML files where possible. Following that idea here will make it easier to put together a development environment on Kind.

@nirrozenbaum
Copy link
Copy Markdown
Collaborator

@nirrozenbaum I took a very quick look at this PR.

I don't like it's structure. This E2E test looks a lot more like the old IGW E2E test and not like the scheduler's E2E test.

@roytman restructured the End to End test and the development environment on Kind to use the same K8S YAML and config YAML files where possible. Following that idea here will make it easier to put together a development environment on Kind.

@asaadbalum can you please take a look on @shmuelk's feedback and work towards setting the e2e to work like they do in llm-d scheduler? (or the new name llm-d router).

Adds end-to-end tests that deploy a complete stack on a Kind cluster:
Envoy proxy (v1.33, FULL_DUPLEX_STREAMED ext_proc), Payload Processor,
Llama and DeepSeek model-server simulators, and adapter ConfigMaps.

Kubernetes manifests live under deploy/ following the llm-d-router
pattern: shared components (deploy/components/) and environment-specific
infrastructure (deploy/environments/dev/e2e-infra/). Test code references
these manifests via relative paths with ${VAR} substitution.

Tests cover base-model routing, LoRA adapter resolution, streaming
requests, and ipp_* metrics exposure.

Signed-off-by: Asaad Balum <asaad.balum@gmail.com>
@asaadbalum asaadbalum force-pushed the feat/issue-14-add-e2e-tests branch from 3c561d1 to 2a0668c Compare May 17, 2026 07:12
@github-actions
Copy link
Copy Markdown

⚠️ Large PR detected

Your PR is large. Please consider breaking it into multiple PRs.

The do-not-merge/hold label has been added and can be removed by the reviewers based on their judgement.

@nirrozenbaum
Copy link
Copy Markdown
Collaborator

cc for another pair of eyes: @noyitz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

add e2e tests

4 participants