OCPBUGS-81521: Adapt dashboard Prometheus polling interval based on query response time#16441
OCPBUGS-81521: Adapt dashboard Prometheus polling interval based on query response time#16441stefanonardo wants to merge 1 commit into
Conversation
|
@stefanonardo: This pull request references Jira Issue OCPBUGS-81521, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@stefanonardo: This pull request references Jira Issue OCPBUGS-81521, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@stefanonardo: This pull request references Jira Issue OCPBUGS-81521, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@stefanonardo: This pull request references Jira Issue OCPBUGS-81521, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
📝 WalkthroughWalkthroughThis pull request refactors the Playwright E2E test infrastructure from global setup/teardown hooks into explicit modular projects (cluster-setup, admin-auth, developer-auth, teardown). It extracts shared login logic into reusable helpers, adds a developer perspective smoke test, and updates the Playwright configuration to reflect the new dependency graph. Additionally, dashboard data fetching is enhanced with adaptive polling that adjusts delays based on response times via exponential moving average, replacing fixed-delay retry behavior. Suggested reviewers
🚥 Pre-merge checks | ✅ 11 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
frontend/playwright.config.ts (1)
131-141:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAlign developer project creation with the same credential checks used by setup.
Developer projects are enabled from a username-only flag, while
developer-auth.setup.tsskips unless both username and password exist. This mismatch can produce developer projects without valid auth state.Suggested patch
-const hasDeveloper = !!process.env.BRIDGE_HTPASSWD_USERNAME; +const hasDeveloper = + !!process.env.BRIDGE_HTPASSWD_USERNAME && !!process.env.BRIDGE_HTPASSWD_PASSWORD;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@frontend/playwright.config.ts` around lines 131 - 141, The developer project generation uses hasDeveloper (username-only) which can create projects without valid auth; change the conditional that yields the developer entries to require the same credentials check used in developer-auth.setup.ts (i.e., ensure both developer username and developer password exist) or reuse the shared helper/flag from developer-auth.setup.ts instead of hasDeveloper; update the condition surrounding devPackages mapping (and any references to developerStorageState) so developer projects are only created when both username and password are present.
🧹 Nitpick comments (1)
frontend/public/actions/__tests__/dashboards.spec.ts (1)
165-174: ⚡ Quick winStrengthen the error-backoff assertion to actually prevent ceiling jumps.
This test currently allows
MAX_POLL_DELAY, which conflicts with its stated intent. Seed with a fast success first, then force an error and assert the second delay increases but stays strictly below max.Proposed test tightening
- it('backs off on fetch error without jumping to MAX_POLL_DELAY', async () => { - const fetchMock = jest.fn().mockRejectedValueOnce(new Error('network error')); + it('backs off on fetch error without jumping to MAX_POLL_DELAY', async () => { + const fetchMock = jest + .fn() + .mockResolvedValueOnce({ data: 'test' }) + .mockRejectedValueOnce(new Error('network error')); setupWatchURL(fetchMock); await flushPromises(); + const firstTimeout = setTimeoutSpy.mock.calls[setTimeoutSpy.mock.calls.length - 1]; + const firstDelay = firstTimeout[1] as number; + + const nextPoll = firstTimeout[0] as (...args: unknown[]) => unknown; + nextPoll(); + await flushPromises(); const lastSetTimeout = setTimeoutSpy.mock.calls[setTimeoutSpy.mock.calls.length - 1]; - expect(lastSetTimeout[1]).toBeGreaterThan(MIN_POLL_DELAY); - expect(lastSetTimeout[1]).toBeLessThanOrEqual(MAX_POLL_DELAY); + expect(lastSetTimeout[1]).toBeGreaterThan(firstDelay); + expect(lastSetTimeout[1]).toBeLessThan(MAX_POLL_DELAY); });🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@frontend/public/actions/__tests__/dashboards.spec.ts` around lines 165 - 174, The test currently allows the backoff to equal MAX_POLL_DELAY which contradicts its intent; update the test in dashboards.spec.ts to first seed a fast successful poll (call setupWatchURL with a fetch mock that resolves once quickly), then trigger a rejection (mockRejectedValueOnce) so the backoff increases from the previous delay, and assert the subsequent timeout delay (inspect setTimeoutSpy.mock.calls[...] like in the existing test) is greater than the prior delay and strictly less than MAX_POLL_DELAY (use < MAX_POLL_DELAY, not <=), while still being > MIN_POLL_DELAY; keep references to setupWatchURL, setTimeoutSpy, flushPromises, MIN_POLL_DELAY and MAX_POLL_DELAY when locating and changing the test.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@frontend/e2e/setup/teardown.setup.ts`:
- Around line 24-50: The teardown leaves CONFIG_FILE (.test-config.json) on
disk; wrap the config read and namespace deletion logic in a try/finally so the
finally always runs and removes CONFIG_FILE regardless of early returns or
errors. Specifically, keep reading into testNamespace/authToken/kubeConfigPath
and running KubernetesClient.deleteNamespace and waitForNamespaceDeleted as
before (referencing CONFIG_FILE, testNamespace, authToken, kubeConfigPath, and
KubernetesClient), but move the current early returns into the try and perform
fs.unlinkSync or fs.rmSync(CONFIG_FILE) inside the finally (guarded by
fs.existsSync and swallowing/logging any unlink errors) so the file is always
removed after teardown completes.
In `@frontend/public/actions/dashboards.ts`:
- Around line 86-94: The catch path currently seeds the EMA with MAX_POLL_DELAY
/ SCALE_FACTOR when responseTimeEma is zero, which can jump retries straight to
MAX_POLL_DELAY; modify the logic around computeAdaptiveDelay so that when
responseTimeEma === 0 you seed the EMA from the "floor" value (the
minimal/steady-state equivalent) instead of MAX_POLL_DELAY / SCALE_FACTOR (e.g.
use a MIN_POLL_DELAY-based seed or the floor EMA value), keeping all calls and
variables (computeAdaptiveDelay, responseTimeEma, MAX_POLL_DELAY, SCALE_FACTOR,
emaToDelay, fetchPeriodically) intact; ensure nextEma is computed from that
floor-seeded value so the first failure backs off conservatively rather than
immediately scheduling the max delay.
---
Outside diff comments:
In `@frontend/playwright.config.ts`:
- Around line 131-141: The developer project generation uses hasDeveloper
(username-only) which can create projects without valid auth; change the
conditional that yields the developer entries to require the same credentials
check used in developer-auth.setup.ts (i.e., ensure both developer username and
developer password exist) or reuse the shared helper/flag from
developer-auth.setup.ts instead of hasDeveloper; update the condition
surrounding devPackages mapping (and any references to developerStorageState) so
developer projects are only created when both username and password are present.
---
Nitpick comments:
In `@frontend/public/actions/__tests__/dashboards.spec.ts`:
- Around line 165-174: The test currently allows the backoff to equal
MAX_POLL_DELAY which contradicts its intent; update the test in
dashboards.spec.ts to first seed a fast successful poll (call setupWatchURL with
a fetch mock that resolves once quickly), then trigger a rejection
(mockRejectedValueOnce) so the backoff increases from the previous delay, and
assert the subsequent timeout delay (inspect setTimeoutSpy.mock.calls[...] like
in the existing test) is greater than the prior delay and strictly less than
MAX_POLL_DELAY (use < MAX_POLL_DELAY, not <=), while still being >
MIN_POLL_DELAY; keep references to setupWatchURL, setTimeoutSpy, flushPromises,
MIN_POLL_DELAY and MAX_POLL_DELAY when locating and changing the test.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 1e3997f8-4837-4ba4-b681-3b92dd163d9c
📒 Files selected for processing (14)
frontend/e2e/global.setup.tsfrontend/e2e/global.teardown.tsfrontend/e2e/setup/admin-auth.setup.tsfrontend/e2e/setup/cluster.setup.tsfrontend/e2e/setup/developer-auth.setup.tsfrontend/e2e/setup/login-helper.tsfrontend/e2e/setup/teardown.setup.tsfrontend/e2e/tests/smoke/developer/smoke-test.spec.tsfrontend/package.jsonfrontend/playwright.config.tsfrontend/public/actions/__tests__/dashboards.spec.tsfrontend/public/actions/dashboards.tsfrontend/public/components/utils/__tests__/adaptive-polling.spec.tsfrontend/public/components/utils/adaptive-polling.ts
💤 Files with no reviewable changes (2)
- frontend/e2e/global.setup.ts
- frontend/e2e/global.teardown.ts
📜 Review details
🔇 Additional comments (9)
frontend/public/components/utils/__tests__/adaptive-polling.spec.ts (1)
1-98: LGTM!frontend/public/components/utils/adaptive-polling.ts (1)
1-34: LGTM!frontend/e2e/setup/cluster.setup.ts (1)
13-65: LGTM!frontend/e2e/setup/login-helper.ts (1)
9-50: LGTM!frontend/e2e/setup/admin-auth.setup.ts (1)
9-18: LGTM!frontend/e2e/setup/developer-auth.setup.ts (1)
9-22: LGTM!frontend/playwright.config.ts (1)
30-126: LGTM!frontend/e2e/tests/smoke/developer/smoke-test.spec.ts (1)
3-8: LGTM!frontend/package.json (1)
57-57: LGTM!
| try { | ||
| const config = JSON.parse(fs.readFileSync(CONFIG_FILE, 'utf-8')); | ||
| testNamespace = config.testNamespace; | ||
| kubeConfigPath = config.kubeConfigPath; | ||
| authToken = config.authToken; | ||
| } catch { | ||
| return; | ||
| } | ||
|
|
||
| if (!testNamespace) { | ||
| return; | ||
| } | ||
|
|
||
| const client = new KubernetesClient( | ||
| { | ||
| clusterUrl: process.env.CLUSTER_URL || '', | ||
| username: process.env.OPENSHIFT_USERNAME || 'kubeadmin', | ||
| password: process.env.BRIDGE_KUBEADMIN_PASSWORD || '', | ||
| token: authToken, | ||
| }, | ||
| kubeConfigPath, | ||
| ); | ||
|
|
||
| await client.deleteNamespace(testNamespace); | ||
| const deleted = await client.waitForNamespaceDeleted(testNamespace, 120_000); | ||
| expect(deleted, `Namespace ${testNamespace} should be deleted within 120s`).toBe(true); | ||
| }); |
There was a problem hiding this comment.
Delete .test-config.json after teardown completes.
This file contains a bearer token and currently persists after cleanup. Remove it in a finally block to reduce secret-retention risk.
Suggested patch
teardown('delete test namespace', async () => {
@@
- let testNamespace: string | undefined;
- let kubeConfigPath: string | undefined;
- let authToken: string | undefined;
-
- try {
- const config = JSON.parse(fs.readFileSync(CONFIG_FILE, 'utf-8'));
- testNamespace = config.testNamespace;
- kubeConfigPath = config.kubeConfigPath;
- authToken = config.authToken;
- } catch {
- return;
- }
-
- if (!testNamespace) {
- return;
- }
-
- const client = new KubernetesClient(
- {
- clusterUrl: process.env.CLUSTER_URL || '',
- username: process.env.OPENSHIFT_USERNAME || 'kubeadmin',
- password: process.env.BRIDGE_KUBEADMIN_PASSWORD || '',
- token: authToken,
- },
- kubeConfigPath,
- );
-
- await client.deleteNamespace(testNamespace);
- const deleted = await client.waitForNamespaceDeleted(testNamespace, 120_000);
- expect(deleted, `Namespace ${testNamespace} should be deleted within 120s`).toBe(true);
+ try {
+ let testNamespace: string | undefined;
+ let kubeConfigPath: string | undefined;
+ let authToken: string | undefined;
+
+ try {
+ const config = JSON.parse(fs.readFileSync(CONFIG_FILE, 'utf-8'));
+ testNamespace = config.testNamespace;
+ kubeConfigPath = config.kubeConfigPath;
+ authToken = config.authToken;
+ } catch {
+ return;
+ }
+
+ if (!testNamespace) {
+ return;
+ }
+
+ const client = new KubernetesClient(
+ {
+ clusterUrl: process.env.CLUSTER_URL || '',
+ username: process.env.OPENSHIFT_USERNAME || 'kubeadmin',
+ password: process.env.BRIDGE_KUBEADMIN_PASSWORD || '',
+ token: authToken,
+ },
+ kubeConfigPath,
+ );
+
+ await client.deleteNamespace(testNamespace);
+ const deleted = await client.waitForNamespaceDeleted(testNamespace, 120_000);
+ expect(deleted, `Namespace ${testNamespace} should be deleted within 120s`).toBe(true);
+ } finally {
+ fs.rmSync(CONFIG_FILE, { force: true });
+ }
});🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@frontend/e2e/setup/teardown.setup.ts` around lines 24 - 50, The teardown
leaves CONFIG_FILE (.test-config.json) on disk; wrap the config read and
namespace deletion logic in a try/finally so the finally always runs and removes
CONFIG_FILE regardless of early returns or errors. Specifically, keep reading
into testNamespace/authToken/kubeConfigPath and running
KubernetesClient.deleteNamespace and waitForNamespaceDeleted as before
(referencing CONFIG_FILE, testNamespace, authToken, kubeConfigPath, and
KubernetesClient), but move the current early returns into the try and perform
fs.unlinkSync or fs.rmSync(CONFIG_FILE) inside the finally (guarded by
fs.existsSync and swallowing/logging any unlink errors) so the file is always
removed after teardown completes.
| // Feed a synthetic slow response into the EMA to gradually back off without jumping to max | ||
| [, nextEma] = computeAdaptiveDelay(MAX_POLL_DELAY / SCALE_FACTOR, responseTimeEma); | ||
| dispatch(setError(type, key, error)); | ||
| dispatch(setData(type, key, null)); | ||
| } finally { | ||
| dispatch(updateWatchInFlight(type, key, false)); | ||
| const timeout = setTimeout( | ||
| () => fetchPeriodically(dispatch, type, key, getURL, getState, fetch), | ||
| URL_POLL_DEFAULT_DELAY, | ||
| () => fetchPeriodically(dispatch, type, key, getURL, getState, fetch, nextEma), | ||
| emaToDelay(nextEma), |
There was a problem hiding this comment.
First error can jump straight to 60s retry, causing overly aggressive backoff.
With responseTimeEma = 0, the catch path seeds EMA with MAX_POLL_DELAY / SCALE_FACTOR, which immediately schedules MAX_POLL_DELAY. That delays recovery after transient first-request failures.
Suggested fix (seed from floor-equivalent EMA when no history)
import {
computeAdaptiveDelay,
emaToDelay,
+ MIN_POLL_DELAY,
MAX_POLL_DELAY,
SCALE_FACTOR,
} from '../components/utils/adaptive-polling';
@@
} catch (error) {
- // Feed a synthetic slow response into the EMA to gradually back off without jumping to max
- [, nextEma] = computeAdaptiveDelay(MAX_POLL_DELAY / SCALE_FACTOR, responseTimeEma);
+ // Feed a synthetic slow response into EMA; if no history, seed from floor-equivalent EMA.
+ const emaSeed =
+ responseTimeEma > 0 ? responseTimeEma : MIN_POLL_DELAY / SCALE_FACTOR;
+ [, nextEma] = computeAdaptiveDelay(MAX_POLL_DELAY / SCALE_FACTOR, emaSeed);
dispatch(setError(type, key, error));
dispatch(setData(type, key, null));
} finally {📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Feed a synthetic slow response into the EMA to gradually back off without jumping to max | |
| [, nextEma] = computeAdaptiveDelay(MAX_POLL_DELAY / SCALE_FACTOR, responseTimeEma); | |
| dispatch(setError(type, key, error)); | |
| dispatch(setData(type, key, null)); | |
| } finally { | |
| dispatch(updateWatchInFlight(type, key, false)); | |
| const timeout = setTimeout( | |
| () => fetchPeriodically(dispatch, type, key, getURL, getState, fetch), | |
| URL_POLL_DEFAULT_DELAY, | |
| () => fetchPeriodically(dispatch, type, key, getURL, getState, fetch, nextEma), | |
| emaToDelay(nextEma), | |
| // Feed a synthetic slow response into EMA; if no history, seed from floor-equivalent EMA. | |
| const emaSeed = | |
| responseTimeEma > 0 ? responseTimeEma : MIN_POLL_DELAY / SCALE_FACTOR; | |
| [, nextEma] = computeAdaptiveDelay(MAX_POLL_DELAY / SCALE_FACTOR, emaSeed); | |
| dispatch(setError(type, key, error)); | |
| dispatch(setData(type, key, null)); | |
| } finally { | |
| dispatch(updateWatchInFlight(type, key, false)); | |
| const timeout = setTimeout( | |
| () => fetchPeriodically(dispatch, type, key, getURL, getState, fetch, nextEma), | |
| emaToDelay(nextEma), |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@frontend/public/actions/dashboards.ts` around lines 86 - 94, The catch path
currently seeds the EMA with MAX_POLL_DELAY / SCALE_FACTOR when responseTimeEma
is zero, which can jump retries straight to MAX_POLL_DELAY; modify the logic
around computeAdaptiveDelay so that when responseTimeEma === 0 you seed the EMA
from the "floor" value (the minimal/steady-state equivalent) instead of
MAX_POLL_DELAY / SCALE_FACTOR (e.g. use a MIN_POLL_DELAY-based seed or the floor
EMA value), keeping all calls and variables (computeAdaptiveDelay,
responseTimeEma, MAX_POLL_DELAY, SCALE_FACTOR, emaToDelay, fetchPeriodically)
intact; ensure nextEma is computed from that floor-seeded value so the first
failure backs off conservatively rather than immediately scheduling the max
delay.
4017ab6 to
97c74b3
Compare
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jhadvig, stefanonardo The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
97c74b3 to
09d4b6c
Compare
|
New changes are detected. LGTM label has been removed. |
QA Verification Evidence
Verification Steps
Step 5: Monitoring dashboards (404 - plugin not loaded) (pass)
Warning This verification was performed by an AI agent. Results may contain false positives or miss Automated QA verification by Claude Code |
|
/retest |
…uery response time Replace the hardcoded 15s polling interval in fetchPeriodically with an adaptive delay derived from an Exponential Moving Average of response times. Fast clusters stay at the 15s floor while slow/large clusters automatically back off up to 60s, reducing unnecessary Prometheus load. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
09d4b6c to
1828b2b
Compare
|
/retest |
|
@stefanonardo: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
















Analysis / Root cause:
Dashboard Prometheus queries poll every 15 seconds regardless of cluster size. On large clusters,
these queries are expensive (more time series to aggregate), creating unnecessary load on the
Prometheus/Thanos monitoring stack.
Solution description:
Replace the hardcoded 15s polling delay in
fetchPeriodicallywith an adaptive interval derivedfrom an Exponential Moving Average (EMA) of query response times. Fast clusters (~500ms responses)
stay at the 15s floor, while slow/large clusters automatically back off up to 60s.
adaptive-polling.tswithcomputeAdaptiveDelay()andemaToDelay()dashboards.tsfetchPeriodicallyto measure fetch duration and compute adaptive delayScreenshots / screen recording:
N/A — no visual changes. Polling interval changes are observable in browser DevTools Network tab.
Test setup:
No special setup required.
Test cases:
computeAdaptiveDelayandemaToDelay(boundary values, EMA smoothing, NaN guards)fetchPeriodicallyadaptive behavior (fast response, slow response, error backoff)Browser conformance:
Additional info:
Jira: https://redhat.atlassian.net/browse/OCPBUGS-81521
Summary by CodeRabbit
New Features
Tests