Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
149 commits
Select commit Hold shift + click to select a range
dee5519
Merge branch 'Azure:master' into master
kaarthis Jul 17, 2025
30ef3ea
control plane enhancement blog
kevintom0927 Mar 28, 2026
afadcb1
control plane enhancement blog
kevintom0927 Mar 31, 2026
25d1b42
typo
kevintom0927 Mar 31, 2026
eb72dbd
adding trucate section
kevintom0927 Mar 31, 2026
0e346b5
description
kevintom0927 Apr 1, 2026
f7c40bf
Add blog on dranet
anson627 Apr 1, 2026
c3326c2
add system diagram for control plane and data plane
anson627 Apr 2, 2026
cb06cdf
update blog to remove placement group and reference system diagrams
anson627 Apr 2, 2026
6e300bd
consolidate getting started and next steps
anson627 Apr 2, 2026
2ed6e8a
update intro section
anson627 Apr 2, 2026
1979650
add benchmark results chart
anson627 Apr 2, 2026
5731b9a
remove benchmark results table
anson627 Apr 2, 2026
1d2975d
polish intro
anson627 Apr 2, 2026
06f142b
adding review changes
kevintom0927 Apr 3, 2026
39afedc
adding review changes 2
kevintom0927 Apr 3, 2026
381506d
nit
kevintom0927 Apr 3, 2026
f0ed0b1
nits
kevintom0927 Apr 3, 2026
c86a140
Apply suggestions from code review
kevinkrp93 Apr 3, 2026
e5df833
Apply suggestions from code review
kevinkrp93 Apr 3, 2026
7efd08a
add conclisuion section
kevintom0927 Apr 3, 2026
fd93570
copilot suggestion
kevintom0927 Apr 3, 2026
dfa1dda
nit in etcd
kevintom0927 Apr 3, 2026
acc84c0
nit in intro
kevintom0927 Apr 3, 2026
409cc9f
nit in author
kevintom0927 Apr 3, 2026
183b9a0
add authors
anson627 Apr 3, 2026
ff109d6
fix lint
anson627 Apr 3, 2026
96218ab
Apply suggestions from code review
anson627 Apr 3, 2026
14ab76f
update benchmark
anson627 Apr 3, 2026
1e3f415
add first draft of agentgateway appnet blog
therealmitchconnors Apr 3, 2026
1bfcc93
Merge branch 'Azure:master' into lates-ccp-enhancements
kevinkrp93 Apr 4, 2026
9382f9c
hero image and last suggestions
kevintom0927 Apr 4, 2026
0ec0385
hero image size
kevintom0927 Apr 4, 2026
2e9689b
co-pilot suggestions
kevintom0927 Apr 4, 2026
fd00054
nit
kevintom0927 Apr 4, 2026
7a57591
fix links
kevintom0927 Apr 4, 2026
ce4aac4
docs(changelog): add details for Kubernetes version 1.35 general avai…
Apr 7, 2026
dfcc7e6
Update CHANGELOG.md
kaarthis Apr 7, 2026
6be88d1
Update CHANGELOG.md
kaarthis Apr 7, 2026
7077bb8
Revise AKS Kubernetes version support details
kaarthis Apr 8, 2026
f1986c5
Blog: AI Inference on AKS Arc - Part 1 - 4 (#5679)
drajpure Apr 8, 2026
08f6cf4
update problem section
anson627 Apr 8, 2026
2c16d95
polish problem section
anson627 Apr 8, 2026
2dcc8b0
Merge pull request #5703 from kaarthis/2026-03-05
kaarthis Apr 8, 2026
adfdb89
Create blog post for agent skills for AKS (#5683)
julia-yin Apr 8, 2026
b7dc6da
Release Notes - 2026-04-02 (#5707)
shashankbarsin Apr 9, 2026
971574f
Merge branch 'master' into lates-ccp-enhancements
kevinkrp93 Apr 9, 2026
5981d95
comments
kevintom0927 Apr 9, 2026
3a52701
refine AI inference series (Parts 1-4) and add Part 5 (TensorRT-LLM)
Apr 9, 2026
0ebed38
Merge pull request #5708 from drajpure/blog/ai-inference-refineadd
drajpure Apr 10, 2026
7ce0aae
adding apf
kevintom0927 Apr 10, 2026
2815c85
apply feedback, add links
therealmitchconnors Apr 10, 2026
82a3ae9
Merge branch 'master' into agw-blog
therealmitchconnors Apr 10, 2026
98d5da8
update solution section
anson627 Apr 11, 2026
fed9856
Apply suggestions from code review
therealmitchconnors Apr 13, 2026
2d24f17
apply feedback
therealmitchconnors Apr 13, 2026
4378195
apply copilot feedback
therealmitchconnors Apr 13, 2026
1ed1f1c
formatting fixes
therealmitchconnors Apr 13, 2026
3c7e83e
add author zhewei
therealmitchconnors Apr 13, 2026
d2e1080
Merge branch 'master' into add-dranet-blog
anson627 Apr 13, 2026
ed3cbe2
Update website/blog/2026-04-01-dranet-rdma-optimization-for-ai-on-aks…
anson627 Apr 13, 2026
9fca347
fix merge conflict
anson627 Apr 13, 2026
475a1d2
Update benchmark
anson627 Apr 13, 2026
276f058
update author
anson627 Apr 13, 2026
bd14ac8
fix typo
anson627 Apr 13, 2026
ad27bf7
fix build
anson627 Apr 13, 2026
e559e90
Apply suggestion from @bmoore-msft
bmoore-msft Apr 13, 2026
8a094df
Merge pull request #5718 from kevinkrp93/edit-apf-rlnote
kevinkrp93 Apr 13, 2026
1a2f6af
Update website/blog/authors.yml
therealmitchconnors Apr 14, 2026
99825e5
fix image links
therealmitchconnors Apr 14, 2026
eb971f3
Updated hero image
sabbour Apr 15, 2026
26d3e79
address feedback
anson627 Apr 15, 2026
d5c317f
address feedback
anson627 Apr 15, 2026
acc6994
Update website/blog/2026-04-01-dranet-rdma-optimization-for-ai-on-aks…
anson627 Apr 15, 2026
d91c8ff
Update website/blog/2026-04-01-dranet-rdma-optimization-for-ai-on-aks…
anson627 Apr 15, 2026
a5d95c3
address feedback
anson627 Apr 15, 2026
aca7652
fix typo
anson627 Apr 15, 2026
e877e21
add hero image
anson627 Apr 15, 2026
099a6ed
update diagram
anson627 Apr 15, 2026
7a54d7b
Update website/blog/2026-04-01-dranet-rdma-optimization-for-ai-on-aks…
anson627 Apr 15, 2026
8d16456
Update website/blog/2026-04-01-dranet-rdma-optimization-for-ai-on-aks…
anson627 Apr 15, 2026
9d779af
address comment
anson627 Apr 15, 2026
9181333
Merge pull request #5697 from Azure/add-dranet-blog
anson627 Apr 15, 2026
f926d16
Merge branch 'Azure:master' into lates-ccp-enhancements
kevinkrp93 Apr 15, 2026
f24b753
finale
kevintom0927 Apr 15, 2026
fa00c36
spelling
kevintom0927 Apr 15, 2026
db86489
nit linter fix
kevintom0927 Apr 16, 2026
0ad3ccb
Apply Collin's suggestions
therealmitchconnors Apr 16, 2026
1bb6b4b
combine solution sections
therealmitchconnors Apr 16, 2026
e664ae8
Merge branch 'master' into agw-blog
therealmitchconnors Apr 16, 2026
dcbb6c8
blog: argo cd extension with microsoft entra sso and terraform
pauldotyu Apr 16, 2026
d04a936
Apply suggestions from code review
therealmitchconnors Apr 16, 2026
4b8f72f
Update website/blog/2026-04-17-argocd-extension-with-microsoft-entra/…
pauldotyu Apr 16, 2026
fb1cd1d
Update website/blog/2026-04-17-argocd-extension-with-microsoft-entra/…
pauldotyu Apr 16, 2026
964828c
Update website/blog/tags.yml
pauldotyu Apr 16, 2026
98bec42
Update website/blog/2026-04-17-argocd-extension-with-microsoft-entra/…
pauldotyu Apr 16, 2026
82f4cdb
apply feedback
therealmitchconnors Apr 17, 2026
1875b91
add support for mermaid diagrams and rename appnet blog
pauldotyu Apr 17, 2026
d498783
attempt to fix lint
therealmitchconnors Apr 17, 2026
12f7a5d
fix line length
therealmitchconnors Apr 17, 2026
ca09871
bump date
pauldotyu Apr 17, 2026
0bc98a1
more lint failures
therealmitchconnors Apr 17, 2026
81404de
Merge pull request #5698 from therealmitchconnors/agw-blog
chzbrgr71 Apr 20, 2026
caf7e5e
bump date and add mention flux
pauldotyu Apr 20, 2026
b6fd104
merge conflicts
kevinkrp93 Apr 20, 2026
db88b15
Merge branch 'Azure:master' into lates-ccp-enhancements
kevinkrp93 Apr 20, 2026
beeee20
Clean up authors.yml by removing duplicates
kevinkrp93 Apr 20, 2026
46731fc
Merge pull request #5689 from kevinkrp93/lates-ccp-enhancements
kevinkrp93 Apr 20, 2026
4253e01
Addessing PR feedback
pauldotyu Apr 20, 2026
516e890
docs(changelog): remove Cilium GW api entry
sf-msft Apr 20, 2026
7ea0115
Merge pull request #5732 from sf-msft/remove-cilium-gw
sf-msft Apr 21, 2026
eb310cb
bump date
pauldotyu Apr 22, 2026
534924d
Refine introduction and overview
pauldotyu Apr 22, 2026
fcc084a
Note this will also work with Automatic clusters
pauldotyu Apr 23, 2026
21309aa
Merge pull request #5727 from pauldotyu/blog/argocd_sso
sanketbakshi1981 Apr 24, 2026
6d4e061
Update manageAssignees.yml configuration
Vyshnavi-MSFT Apr 24, 2026
44ff7d8
Fix link for AKS troubleshooting sub-skills
PixelRobots Apr 28, 2026
356fe34
Merge pull request #5744 from Azure/PixelRobots-patch-2
sanketbakshi1981 Apr 29, 2026
52c1320
Merge pull request #5740 from Vyshnavi-MSFT/patch-1
seguler May 1, 2026
dc21eff
Update CHANGELOG with release notes for 2026-04-30
alvinli222 May 1, 2026
79b5ba5
Update release date in CHANGELOG.md
alvinli222 May 1, 2026
6e4dd03
added vhd files and updated comments. Removed TODOs
alvinli222 May 1, 2026
1ffb20c
addressing comments
alvinli222 May 1, 2026
97d7656
Potential fix for pull request finding
alvinli222 May 1, 2026
ae7d763
Potential fix for pull request finding
alvinli222 May 1, 2026
8f003db
Potential fix for pull request finding
alvinli222 May 1, 2026
a781f25
Update CHANGELOG for new features and GA announcements
dyu1208 May 2, 2026
44285f6
Update CHANGELOG with announcements and Kubernetes versions
dyu1208 May 2, 2026
c0591d1
Update CHANGELOG with April 2026 release notes
dyu1208 May 2, 2026
666e283
Increase memory limits for AKS monitoring components
dyu1208 May 2, 2026
1f49b61
Update CHANGELOG.md
dyu1208 May 2, 2026
fccaa21
Potential fix for pull request finding
dyu1208 May 2, 2026
e6995be
Potential fix for pull request finding
dyu1208 May 2, 2026
31c6d59
Fix wording for Gateway API ingress feature in CHANGELOG
dyu1208 May 2, 2026
f0ee706
Potential fix for pull request finding
dyu1208 May 2, 2026
c6f0043
Potential fix for pull request finding
dyu1208 May 2, 2026
88d5149
Community Call Agenda - May 2026 (#5754)
sanketbakshi1981 May 4, 2026
1e59f27
Update CHANGELOG.md
alvinli222 May 4, 2026
787349e
addressing comments
alvinli222 May 4, 2026
ee6a201
Potential fix for pull request finding
alvinli222 May 4, 2026
12a8515
extra space fix
alvinli222 May 4, 2026
21a2091
Potential fix for pull request finding
alvinli222 May 4, 2026
b3cc528
Potential fix for pull request finding
alvinli222 May 4, 2026
5fd4de2
removed 1 line
alvinli222 May 4, 2026
9eab3c2
updated release notes to match release name
alvinli222 May 4, 2026
b0c5a32
Potential fix for pull request finding
alvinli222 May 4, 2026
9c9791c
Merge pull request #5755 from Azure/2026-05-01-release
alvinli222 May 4, 2026
bf7ec80
Add Long Term Support Premium-tier billing information to changelog
May 6, 2026
f8f61c9
Merge branch 'master' of https://github.com/kaarthis/AKS
May 6, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .github/policies/manageAssignees.yml
Original file line number Diff line number Diff line change
Expand Up @@ -304,6 +304,7 @@ configuration:
- mentionUsers:
mentionees:
- therealmitchconnors
- Vyshnavi-MSFT
- JackStromberg
replyTemplate: ${mentionees} would you be able to assist?
assignMentionees: True
Expand All @@ -317,6 +318,7 @@ configuration:
- mentionUsers:
mentionees:
- JackStromberg
- Vyshnavi-MSFT
- therealmitchconnors
replyTemplate: ${mentionees} would you be able to assist?
assignMentionees: True
Expand All @@ -331,6 +333,7 @@ configuration:
mentionees:
- therealmitchconnors
- JackStromberg
- Vyshnavi-MSFT
replyTemplate: ${mentionees} would you be able to assist?
assignMentionees: True
# Upstream - Helm
Expand Down
142 changes: 142 additions & 0 deletions CHANGELOG.md

Large diffs are not rendered by default.

294 changes: 294 additions & 0 deletions vhd-notes/AKSWindows/2022/20348.4893.260311.txt

Large diffs are not rendered by default.

315 changes: 315 additions & 0 deletions vhd-notes/AKSWindows/2022/20348.5020.260415.txt

Large diffs are not rendered by default.

232 changes: 232 additions & 0 deletions vhd-notes/AKSWindows/2025/26100.32522.260311.txt

Large diffs are not rendered by default.

380 changes: 380 additions & 0 deletions vhd-notes/AKSWindows/2025/26100.32690.260415.txt

Large diffs are not rendered by default.

449 changes: 449 additions & 0 deletions vhd-notes/AKSWindows/23H2/25398.2207.260311.txt

Large diffs are not rendered by default.

470 changes: 470 additions & 0 deletions vhd-notes/AKSWindows/23H2/25398.2274.260415.txt

Large diffs are not rendered by default.

589 changes: 589 additions & 0 deletions vhd-notes/AzureLinuxv3/202603.12.0.txt

Large diffs are not rendered by default.

589 changes: 589 additions & 0 deletions vhd-notes/AzureLinuxv3/202603.18.0.txt

Large diffs are not rendered by default.

589 changes: 589 additions & 0 deletions vhd-notes/AzureLinuxv3/202603.18.1.txt

Large diffs are not rendered by default.

593 changes: 593 additions & 0 deletions vhd-notes/AzureLinuxv3/202603.30.0.txt

Large diffs are not rendered by default.

593 changes: 593 additions & 0 deletions vhd-notes/AzureLinuxv3/202604.13.0.txt

Large diffs are not rendered by default.

606 changes: 606 additions & 0 deletions vhd-notes/AzureLinuxv3/202604.24.0.txt

Large diffs are not rendered by default.

931 changes: 931 additions & 0 deletions vhd-notes/aks-ubuntu/AKSUbuntu-2204/202603.12.0.txt

Large diffs are not rendered by default.

931 changes: 931 additions & 0 deletions vhd-notes/aks-ubuntu/AKSUbuntu-2204/202603.18.0.txt

Large diffs are not rendered by default.

931 changes: 931 additions & 0 deletions vhd-notes/aks-ubuntu/AKSUbuntu-2204/202603.18.1.txt

Large diffs are not rendered by default.

935 changes: 935 additions & 0 deletions vhd-notes/aks-ubuntu/AKSUbuntu-2204/202603.30.0.txt

Large diffs are not rendered by default.

935 changes: 935 additions & 0 deletions vhd-notes/aks-ubuntu/AKSUbuntu-2204/202604.13.0.txt

Large diffs are not rendered by default.

955 changes: 955 additions & 0 deletions vhd-notes/aks-ubuntu/AKSUbuntu-2204/202604.24.0.txt

Large diffs are not rendered by default.

974 changes: 974 additions & 0 deletions vhd-notes/aks-ubuntu/AKSUbuntu-2404/202603.12.0.txt

Large diffs are not rendered by default.

974 changes: 974 additions & 0 deletions vhd-notes/aks-ubuntu/AKSUbuntu-2404/202603.18.0.txt

Large diffs are not rendered by default.

974 changes: 974 additions & 0 deletions vhd-notes/aks-ubuntu/AKSUbuntu-2404/202603.18.1.txt

Large diffs are not rendered by default.

978 changes: 978 additions & 0 deletions vhd-notes/aks-ubuntu/AKSUbuntu-2404/202603.30.0.txt

Large diffs are not rendered by default.

978 changes: 978 additions & 0 deletions vhd-notes/aks-ubuntu/AKSUbuntu-2404/202604.13.0.txt

Large diffs are not rendered by default.

992 changes: 992 additions & 0 deletions vhd-notes/aks-ubuntu/AKSUbuntu-2404/202604.24.0.txt

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
76 changes: 76 additions & 0 deletions website/blog/2026-03-30-aks-control-plane-enhancements/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
---
title: "AKS Control Plane Enhancements"
description: "Learn how AKS improves control plane scalability and stability through a new set of control plane enhancements"
date: 2026-03-30
authors: ["kevin-thomas"]
tags:
- operations
- scaling
- performance
---
Azure Kubernetes Service (AKS) now includes several control plane enhancements to enable large clusters scale more efficiently and operate reliably. These enhancements include streaming LIST responses, higher control plane resource limits, API server guard and etcd defragmentation optimizations.

<!-- truncate -->
![AKS control plane enhancements for scalability, stability and performance](./control-plane-enhancements-large.png)

## Introduction

A key factor in Kubernetes scalability is how clients interact with the control plane and how those interactions shape the cluster’s overall scale envelope. Unoptimized API server call patterns by clients and growing etcd footprints can place increasing pressure on the control plane, ultimately limiting cluster scalability. To address this, AKS includes a set of built‑in control plane enhancements for large clusters that automatically improve scalability, performance, and stability without requiring any manual configuration from customers.

## Streaming encoder for LIST responses

The API server's response encoders traditionally serialize the entire response into a contiguous block of memory and perform one [ResponseWriter.Write](https://pkg.go.dev/net/http#ResponseWriter.Write) call to transmit data to the client. If multiple large LIST requests arrive simultaneously, the cumulative memory consumption can grow quickly, leading to Out-of-Memory (OOM) events that compromise cluster stability.

Kubernetes v1.33 introduced [streaming encoding for LIST responses](https://kubernetes.io/blog/2025/05/09/kubernetes-v1-33-streaming-list-responses/). This approach processes and transmits each item individually, so memory is freed progressively as each chunk is sent. In benchmarks, this reduced memory usage by up to 20x in heavy LIST scenarios.

This capability is backported to AKS versions 1.31.9+ and 1.32.6+, so your clusters benefit before upgrading to 1.33.

### Benefits

- **Reduced memory consumption**: Your API server uses significantly less memory when handling large list requests. This reduces the likelihood of OOM events, thereby improving API server response time.
- **Increased scalability and stability**: Your API server can handle more concurrent requests and larger datasets, increasing your cluster's current scale ceiling.

## Higher control plane resource limits

AKS autoscales your control plane based on cluster size, measured by total compute cores in the cluster, and the control plane's CPU and memory utilization. With this enhancement, your AKS control plane can now receive up to 4x higher CPU and memory limits during scaling. This gives large clusters more room to handle the most demanding workloads.

### Benefits

- **Greater scalability**: Your cluster can support more nodes and workloads. This is especially beneficial for advanced scenarios such as AI inference and training.
- **Lower latency**: Higher CPU and memory help reduce your API server's response time.
- **Higher stability**: Your control plane encounters fewer bottlenecks and remains more stable under heavy load.

## AKS managed API server guard

When the API server remains unstable after scaling to the maximum control plane resource limits, and out-of-memory (OOM) incidents continue, AKS applies a [managed flow schema and priority level configuration](https://learn.microsoft.com/troubleshoot/azure/azure-kubernetes/create-upgrade-delete/troubleshoot-apiserver-etcd?tabs=resource-specific#cause-4-aks-managed-api-server-guard-was-applied) that throttles non-system API server requests.

In most cases, resource-intensive LIST operations from unoptimized clients trigger this instability. This last-resort safeguard keeps the API server stable and operational, even under extreme load.

### Benefits

- **Protects API server integrity**: Prevents your API server from becoming unresponsive due to excessive load, helping preserve overall cluster stability.
- **Simplified troubleshooting**: AKS proactively notifies you through a [resource health notification](https://learn.microsoft.com/azure/service-health/resource-health-overview) when API server guard is applied. The [API server resource intensive listing detector](https://learn.microsoft.com/troubleshoot/azure/azure-kubernetes/create-upgrade-delete/troubleshoot-apiserver-etcd?tabs=resource-specific#step-2---identify-and-analyse-latency-for-user-agent) in Diagnose & Solve helps you identify unoptimized clients. Once client call patterns are optimized, you also have the ability to override or modify the managed API server guard.

![Resource Health alert notification indicating that AKS managed API server guard was applied to your cluster](./rh-apf-image.png)

## etcd defragmentation optimizations

Defragmentation is essential for reclaiming unused space in etcd by rewriting fragmented data into contiguous storage. Because defragmentation blocks reads and writes, it runs on one etcd replica at a time, and larger databases take longer to complete.

AKS now includes etcd defragmentation optimizations for large clusters, reducing defragmentation time by up to 50%. For example, in a sample cluster with an etcd size of about 2 GB, per-replica defragmentation time decreased from about 18 seconds to about 9 seconds.

### Benefits

- Reduces your API server's response time spikes and transient client timeouts during etcd operations that serve client reads and writes.

## Conclusion

These improvements make your control plane more resilient, scalable, and performant, and reduce the manual configuration needed to scale your existing clusters to handle the most demanding workloads. Always remember, the Kubernetes scale envelope remains multidimensional. The number and size of cluster objects, such as pods, nodes, CRDs, Secrets, ConfigMaps, and other resources along with client behavior, continue to play a critical role in how efficiently your cluster scales.

We’re always working to improve the Kubernetes control plane upstream and the AKS control plane downstream to make the platform more reliable and easier to operate. If you have feedback or ideas, we’d love to hear from you.

To learn more about the Kubernetes scale envelope, its interaction with the control plane, client optimization, creating resource health alerts and best practices for running large clusters, refer to:

- **[AKS Best Practices for Large Clusters](https://learn.microsoft.com/azure/aks/best-practices-performance-scale-large)**
- **[API Server and etcd - Troubleshooting Guide](https://learn.microsoft.com/troubleshoot/azure/azure-kubernetes/create-upgrade-delete/troubleshoot-apiserver-etcd)**
- **[Create Resource Health Alerts](https://learn.microsoft.com/azure/service-health/resource-health-alert-arm-template-guide)**
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
---
config:
theme: base
themeVariables:
primaryColor: "#9f62eb"
---
xychart-beta
title "NCCL all_reduce_perf — Avg Bus Bandwidth (GB/s)"
x-axis ["1nic-unaligned (cross-NUMA)", "1nic-aligned (same NUMA)", "2nic-aligned (same NUMA)"]
y-axis "Avg busbw (GB/s)" 0 --> 120
bar [25, 56, 112]
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
flowchart TB
subgraph User["Workload Author"]
RCT["ResourceClaimTemplate<br/>(CEL selectors)"]
PodSpec["Pod Spec<br/>(resourceClaims reference)"]
end

subgraph CP["Kubernetes Control Plane"]
API["API Server<br/>(DRA API group)"]
Sched["Scheduler<br/>(Topology-aware)"]
RS_GPU["ResourceSlice<br/>(gpu.nvidia.com)<br/>pciBusID, NUMA, pcieRoot"]
RS_NIC["ResourceSlice<br/>(dra.net)<br/>rdmaDevice, NUMA, pciAddress"]
end

subgraph Node["Kubernetes Node (Azure ND GB300-v6)"]
NVDRV["NVIDIA GPU DRA Driver<br/>(DaemonSet)"]
DRANETDRV["DRANET DRA Driver<br/>(DaemonSet)"]
end

%% User submits workload
PodSpec -->|"Submit pod with<br/>resource claims"| API
RCT -->|"Define GPU+NIC<br/>alignment constraints"| API

%% Drivers publish device topology
NVDRV -->|"Discover GPUs &<br/>publish topology"| RS_GPU
DRANETDRV -->|"Discover NICs &<br/>publish topology"| RS_NIC

%% Scheduler uses slices to allocate
RS_GPU --> Sched
RS_NIC --> Sched
Sched -->|"Evaluate CEL selectors"| API
API -->|"Bind pod to node<br/>with allocated devices"| Node

%% Styling
style User fill:#fef7e0,stroke:#fbbc04
style CP fill:#e8f0fe,stroke:#4285f4
style Node fill:#f3e8fd,stroke:#9f62eb
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
flowchart TB
Kubelet["Kubelet"]
CRI["containerd"]
NRI["NRI Plugin<br/>(DRANET)"]

subgraph NUMA0["NUMA Node 0"]
GPU0["GPU 0<br/>NVIDIA GB300"]
GPU1["GPU 1<br/>NVIDIA GB300"]
NIC0["NIC 0<br/>NVIDIA ConnectX-8"]
NIC1["NIC 1<br/>NVIDIA ConnectX-8"]
end

subgraph NUMA1["NUMA Node 1"]
GPU2["GPU 2<br/>NVIDIA GB300"]
GPU3["GPU 3<br/>NVIDIA GB300"]
NIC2["NIC 2<br/>NVIDIA ConnectX-8"]
NIC3["NIC 3<br/>NVIDIA ConnectX-8"]
end

subgraph Pod["Scheduled Pod"]
Container["Container<br/>/dev/infiniband/uverbs*"]
end

%% Runtime flow
Kubelet -->|"1. Receive device allocation<br/>result from API Server"| CRI
CRI -->|"2. Execute OCI CreateContainer<br/>hook"| NRI
NRI -->|"3. Inject allocated<br/>/dev/infiniband/* devices"| Pod

%% NUMA-aligned GDR paths
GPU0 <-.->|"PCIe · GDR ✓"| NIC0
GPU1 <-.->|"PCIe · GDR ✓"| NIC1
GPU2 <-.->|"PCIe · GDR ✓"| NIC2
GPU3 <-.->|"PCIe · GDR ✓"| NIC3

%% Cross-NUMA penalty
GPU0 <-.->|"QPI/UPI · No GDR ✗"| NIC3

%% Pod uses aligned devices
Container -.->|"4. NCCL uses<br/>GPU * + mlx5_*"| GPU0

%% Styling
style NUMA0 fill:#e6f4ea,stroke:#34a853
style NUMA1 fill:#fce8e6,stroke:#ea4335
style Pod fill:#fef7e0,stroke:#fbbc04
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Loading