Problem Statement
Currently, Kubernetes scalability and end-to-end tests primarily focus on the "happy path" of execution. We lack systematic testing to validate and understand the side effects of complex execution paths that occur under failure conditions and overload.
As highlighted in the Embracing Graceful Degradation in Kubernetes, we need to "Systematically Track and Test Historical Failure Modes" to better understand the complex feedback loops that drive the system under stress. A perfect historical example is kubernetes/kubernetes#129795: ReplicaSet controller when overloaded, it fails to see its own writes and starts creating multiple pods per node, amplifying the problem.
Continuing to rely solely on massive 5k node scalability tests to understand how Kubernetes degrades at large scale is inefficient and misses these nuanced failure modes. While we could follow the etcd path of developing our own custom failure injection tooling like https://github.com/etcd-io/gofail or use existing state of the art platform already available for free in CNCF.
Proposed Solution
We request that to file a CNCF Service Desk ticket on behalf of the Kubernetes project to gain access to the Antithesis deterministic simulation testing platform.
Antithesis provides a unique testing environment that allows for deterministic execution of code, enabling the detection of subtle bugs. As announced during the KubeCon US 2025 opening remarks, Antithesis is partnering with CNCF to provide their product for free to all Graduated CNCF Projects (Announcement Link).
The etcd project has already successfully utilized Antithesis to perform deterministic fault injection (network faults, process pauses, etc.) to discover multiple elusive bugs without needing to build the infrastructure themselves (etcd robustness track record).
Proposed PoC for Kubernetes: We will use Antithesis fault injection to slow down watches and manipulate time to successfully reproduce the ReplicaSet pod leaking issue. This will serve as our first step to test the efficacy of KEP-5647 (Delayed Reconciliation on Stale Watch) and prove the value of deterministic simulation for Kubernetes controllers.
Cost
None. We do not need to ask for new funding or sponsorship. The platform is offered for free to graduated CNCF projects, and we can leverage the existing relationship and framework established via the etcd project's engagement with Antithesis.
Other Considerations, Notes, or References
Problem Statement
Currently, Kubernetes scalability and end-to-end tests primarily focus on the "happy path" of execution. We lack systematic testing to validate and understand the side effects of complex execution paths that occur under failure conditions and overload.
As highlighted in the Embracing Graceful Degradation in Kubernetes, we need to "Systematically Track and Test Historical Failure Modes" to better understand the complex feedback loops that drive the system under stress. A perfect historical example is kubernetes/kubernetes#129795: ReplicaSet controller when overloaded, it fails to see its own writes and starts creating multiple pods per node, amplifying the problem.
Continuing to rely solely on massive 5k node scalability tests to understand how Kubernetes degrades at large scale is inefficient and misses these nuanced failure modes. While we could follow the etcd path of developing our own custom failure injection tooling like https://github.com/etcd-io/gofail or use existing state of the art platform already available for free in CNCF.
Proposed Solution
We request that to file a CNCF Service Desk ticket on behalf of the Kubernetes project to gain access to the Antithesis deterministic simulation testing platform.
Antithesis provides a unique testing environment that allows for deterministic execution of code, enabling the detection of subtle bugs. As announced during the KubeCon US 2025 opening remarks, Antithesis is partnering with CNCF to provide their product for free to all Graduated CNCF Projects (Announcement Link).
The etcd project has already successfully utilized Antithesis to perform deterministic fault injection (network faults, process pauses, etc.) to discover multiple elusive bugs without needing to build the infrastructure themselves (etcd robustness track record).
Proposed PoC for Kubernetes: We will use Antithesis fault injection to slow down watches and manipulate time to successfully reproduce the ReplicaSet pod leaking issue. This will serve as our first step to test the efficacy of KEP-5647 (Delayed Reconciliation on Stale Watch) and prove the value of deterministic simulation for Kubernetes controllers.
Cost
None. We do not need to ask for new funding or sponsorship. The platform is offered for free to graduated CNCF projects, and we can leverage the existing relationship and framework established via the etcd project's engagement with Antithesis.
Other Considerations, Notes, or References