Skip to content

SAI API Performance Monitoring#2279

Open
JaiOCP wants to merge 3 commits into
opencomputeproject:masterfrom
JaiOCP:perfmon
Open

SAI API Performance Monitoring#2279
JaiOCP wants to merge 3 commits into
opencomputeproject:masterfrom
JaiOCP:perfmon

Conversation

@JaiOCP
Copy link
Copy Markdown
Contributor

@JaiOCP JaiOCP commented Apr 21, 2026

This PR brings in support for measuring SAI API performance. This is based on presentation done in OCP 2023.

Signed-off-by: JaiOCP <jai.kumar@broadcom.com>
Copy link
Copy Markdown
Contributor

@rck-innovium rck-innovium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While most of the measurements can be done at the application level, this proposal provides a way to measure the metrics per object operation inside bulk APIs which cannot be done by application level performance monitoring.

Comment thread doc/perfmon/SAI-perfmon-Spec.md
Comment thread doc/perfmon/SAI-perfmon-Spec.md Outdated
@rck-innovium
Copy link
Copy Markdown
Contributor

As discussed, the community concluded that we should not preserve this perfmon data across warmboot (especially since we thought it does not make sense for warm upgrades/ downgrades)

Comment thread inc/saiswitch.h
Comment thread doc/perfmon/SAI-perfmon-Spec.md Outdated
@tjchadaga tjchadaga added the reviewed PR is discussed in SAI Meeting label Apr 28, 2026
@deepak-singhal0408
Copy link
Copy Markdown

@JaiOCP, could you address the comments? thanks,

@JaiOCP
Copy link
Copy Markdown
Contributor Author

JaiOCP commented May 8, 2026

Review comments address. Please take a look @j-bos @rck-innovium

@JaiOCP
Copy link
Copy Markdown
Contributor Author

JaiOCP commented May 8, 2026

@JaiOCP, could you address the comments? thanks,

Commens addressed. Please review

@deepak-singhal0408
Copy link
Copy Markdown

Question: Per-object latency with aggregated reads across variable batch sizes

The spec describes PERFDATA as clear-on-read, with AVG_LATENCY computed across multiple API invocations between reads. In a typical route convergence scenario, orchagent may issue several bulk_create calls with varying batch sizes (e.g., 50, 3000, 200) before reading PERFDATA.

The returned average latency is per-call — but each call processes a different number of objects. Without knowing the total object count across those calls, the consumer cannot derive per-object latency:

Call 1: bulk_create(50 routes)   → 50µs
Call 2: bulk_create(3000 routes) → 900µs  
Call 3: bulk_create(200 routes)  → 100µs
Read AVG_LATENCY → (50+900+100)/3 = 350µs per-call avg

But per-route avg = (50+900+100)/(50+3000+200) = 0.32µs
    ← cannot be derived without total object count

The previous revision (#2265) addressed this with num_objects in sai_perfdata_t. Would it make sense to add an object count attribute (e.g., SAI_PERFMON_ATTR_OBJECT_COUNT, also READ_ONLY + clear-on-read) so consumers can compute per-object metrics from aggregated reads?

Comment thread inc/saiperfmon.h
Comment on lines +178 to +180
/**
* @brief SAI Performance Monitoring API set
*/
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this commet is no needed

@kcudnik
Copy link
Copy Markdown
Collaborator

kcudnik commented May 14, 2026

Question: Per-object latency with aggregated reads across variable batch sizes

The spec describes PERFDATA as clear-on-read, with AVG_LATENCY computed across multiple API invocations between reads. In a typical route convergence scenario, orchagent may issue several bulk_create calls with varying batch sizes (e.g., 50, 3000, 200) before reading PERFDATA.

The returned average latency is per-call — but each call processes a different number of objects. Without knowing the total object count across those calls, the consumer cannot derive per-object latency:

Call 1: bulk_create(50 routes)   → 50µs
Call 2: bulk_create(3000 routes) → 900µs  
Call 3: bulk_create(200 routes)  → 100µs
Read AVG_LATENCY → (50+900+100)/3 = 350µs per-call avg

But per-route avg = (50+900+100)/(50+3000+200) = 0.32µs
    ← cannot be derived without total object count

The previous revision (#2265) addressed this with num_objects in sai_perfdata_t. Would it make sense to add an object count attribute (e.g., SAI_PERFMON_ATTR_OBJECT_COUNT, also READ_ONLY + clear-on-read) so consumers can compute per-object metrics from aggregated reads?

i would assume that create_bulk route is one of the heaviest api to call, and i would guess that other bulk api maybe faster, and readig performance also should be fast, internally it should be just reading/copying a table

@JaiOCP
Copy link
Copy Markdown
Contributor Author

JaiOCP commented May 14, 2026

Question: Per-object latency with aggregated reads across variable batch sizes

The spec describes PERFDATA as clear-on-read, with AVG_LATENCY computed across multiple API invocations between reads. In a typical route convergence scenario, orchagent may issue several bulk_create calls with varying batch sizes (e.g., 50, 3000, 200) before reading PERFDATA.

The returned average latency is per-call — but each call processes a different number of objects. Without knowing the total object count across those calls, the consumer cannot derive per-object latency:

Call 1: bulk_create(50 routes)   → 50µs
Call 2: bulk_create(3000 routes) → 900µs  
Call 3: bulk_create(200 routes)  → 100µs
Read AVG_LATENCY → (50+900+100)/3 = 350µs per-call avg

But per-route avg = (50+900+100)/(50+3000+200) = 0.32µs
    ← cannot be derived without total object count

The previous revision (#2265) addressed this with num_objects in sai_perfdata_t. Would it make sense to add an object count attribute (e.g., SAI_PERFMON_ATTR_OBJECT_COUNT, also READ_ONLY + clear-on-read) so consumers can compute per-object metrics from aggregated reads?

HI Deepak,

As we talked about this, computation done this way is a wrong implementation in SAI adapter.
SAI Adapter MUST do as follows:
Call 1: bulk_create(50 routes) → 50µs

Call 2: bulk_create(3000 routes) → 900µs
Call 3: bulk_create(200 routes) → 100µs
Read AVG_LATENCY → (50+900+100)/(50+3000+200) = 0.32µs
Essentially SAI Adapter need to maintain the object count for computing the average till next clear on read.
.

@JaiOCP
Copy link
Copy Markdown
Contributor Author

JaiOCP commented May 15, 2026

@rck-innovium @j-bos Please approve the PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

reviewed PR is discussed in SAI Meeting

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants