Request access to Antithesis failure injection platform for Kubernetes

### Problem Statement

Currently, Kubernetes scalability and end-to-end tests primarily focus on the "happy path" of execution. We lack systematic testing to validate and understand the side effects of complex execution paths that occur under failure conditions and overload.

As highlighted in the [Embracing Graceful Degradation in Kubernetes](https://docs.google.com/document/d/1raierreVsdjEnG-9ksWJa-OM_QwJOVS4fE_dvvs3894/edit?usp=sharing&resourcekey=0-8U3sqoWAKk809Wt83kfROw), we need to "Systematically Track and Test Historical Failure Modes" to better understand the complex feedback loops that drive the system under stress. A perfect historical example is https://github.com/kubernetes/kubernetes/issues/129795: ReplicaSet controller when overloaded, it fails to see its own writes and starts creating multiple pods per node, amplifying the problem. 

Continuing to rely solely on massive 5k node scalability tests to understand how Kubernetes degrades at large scale is inefficient and misses these nuanced failure modes. While we could follow the etcd path of developing our own custom failure injection tooling like https://github.com/etcd-io/gofail or use existing state of the art platform already available for free in CNCF.

### Proposed Solution

We request that to file a CNCF Service Desk ticket on behalf of the Kubernetes project to gain access to the Antithesis deterministic simulation testing platform.

Antithesis provides a unique testing environment that allows for deterministic execution of code, enabling the detection of subtle bugs. As announced during the KubeCon US 2025 opening remarks, Antithesis is partnering with CNCF to provide their product for free to all Graduated CNCF Projects ([Announcement Link](https://www.youtube.com/watch?v=cQvtT2vRhok&t=34m30s)).

The etcd project has already successfully utilized Antithesis to perform deterministic fault injection (network faults, process pauses, etc.) to discover multiple elusive bugs without needing to build the infrastructure themselves ([etcd robustness track record](https://github.com/etcd-io/etcd/tree/main/tests/robustness#robustness-track-record)).

**Proposed PoC for Kubernetes:** We will use Antithesis fault injection to slow down watches and manipulate time to successfully reproduce the ReplicaSet pod leaking issue. This will serve as our first step to test the efficacy of [KEP-5647 (Delayed Reconciliation on Stale Watch)](https://github.com/kubernetes/enhancements/blob/master/keps/sig-api-machinery/5647-stale-controller-handling/README.md) and prove the value of deterministic simulation for Kubernetes controllers.

### Cost

None. We do not need to ask for new funding or sponsorship. The platform is offered for free to graduated CNCF projects, and we can leverage the existing relationship and framework established via the etcd project's engagement with Antithesis.

### Other Considerations, Notes, or References
- KubeCon EU 2026 Talk: [Keeping the Cloud Afloat with Deterministic Simulation Testing - Marcus Hodgson, Antithesis & Marek Siarkowicz, Google](https://kccnceu2026.sched.com/event/2CW6k/keeping-the-cloud-afloat-with-deterministic-simulation-testing-marcus-hodgson-antithesis-marek-siarkowicz-google)
- KEP-5647 (Stale controller handling): https://github.com/kubernetes/enhancements/blob/master/keps/sig-api-machinery/5647-stale-controller-handling/README.md
- Antithesis Github Action: https://github.com/etcd-io/etcd/blob/main/.github/workflows/antithesis-test.yml
- Integration code with etcd: https://github.com/etcd-io/etcd/tree/main/tests/antithesis



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request access to Antithesis failure injection platform for Kubernetes #309

Problem Statement

Proposed Solution

Cost

Other Considerations, Notes, or References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Request access to Antithesis failure injection platform for Kubernetes #309

Description

Problem Statement

Proposed Solution

Cost

Other Considerations, Notes, or References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions