Skip to content

e2e: increase waitForRoutesTimeout from 90s to 120s#2799

Open
martinsander00 wants to merge 1 commit intomainfrom
ms/qa-timeout-investigation
Open

e2e: increase waitForRoutesTimeout from 90s to 120s#2799
martinsander00 wants to merge 1 commit intomainfrom
ms/qa-timeout-investigation

Conversation

@martinsander00
Copy link
Contributor

@martinsander00 martinsander00 commented Feb 3, 2026

Important

The primary goal of this PR is to increase waitForRoutesTimeout from 90s to 120s. The diagnostic test file (qa_bgp_propagation_test.go) is included for reference but will be removed before merging.

Summary

  • Increase waitForRoutesTimeout from 90s to 120s to reduce flaky QA test failures
  • Add diagnostic test (TestQA_BGPPropagationVariance) used to investigate the root cause

Investigation

QA tests (TestQA_UnicastConnectivity, multicast tests) were intermittently timing out at the "waiting for routes to be installed" step, particularly for the fra↔sgp (Frankfurt↔Singapore) route.

What we tested

  1. Manual timing of BGP route propagation between fra-tn-qa01 and sgp-tn-qa01
  2. Device comparison: fra-dz001 (jump_ contributor) vs fra-dz-001-x (rox contributor)
  3. Variance test: 5 iterations measuring propagation time with detailed timing breakdown

Findings

Iter Disconnect Fra Connect (BGP up) Sgp Connect (BGP up) Route Propagation Total
1 5.5s 38.0s 35.7s 0.1s 38.5s
2 7.7s 9.9s 29.8s 48.2s 79.9s
3 7.7s 9.9s 31.3s 45.8s 78.8s
4 12.1s 6.1s 32.3s 35.3s 69.6s
5 11.2s 8.2s 33.3s 33.3s 68.4s

Total = Sgp Connect + Route Propagation (timing starts when sgp begins connecting)

Additional context - Fra initial routes at start of each iteration:

Iter 1 Iter 2 Iter 3 Iter 4 Iter 5
69 routes 0 routes 70 routes 57 routes 70 routes

Key observations:

  • Cold vs warm BGP state: When fra has existing routes (warm), new routes propagate instantly (~100ms). When fra's BGP table is empty (cold start), propagation takes 33-48s
  • Iteration 2 anomaly: Fra reported "BGP Session Up" but had 0 routes, indicating a gap between session establishment and route exchange
  • Total propagation time: sgp connect (~30s) + route wait (33-48s) = 65-80s in worst case
  • 90s timeout is borderline: With variance, cold-start scenarios can exceed 90s

Root cause

There's a delay between "BGP Session Up" status and actual route exchange completing. In cold-start scenarios (first test of the day, or after BGP state reset), route propagation between distant exchanges (xfra↔xsin) can take 65-80s total, which approaches or exceeds the 90s timeout.

Solution

Increase waitForRoutesTimeout from 90s to 120s to provide sufficient headroom for worst-case BGP propagation times.

Testing Verification

  • Ran TestQA_BGPPropagationVariance 5 iterations - all passed with new 120s timeout
  • Ran TestQA_UnicastConnectivity with all 4 hosts - passed in 113s
  • go build -tags=qa ./e2e/... passes

@martinsander00 martinsander00 force-pushed the ms/qa-timeout-investigation branch from c9d0d0d to 99d53f3 Compare February 3, 2026 02:33
Add diagnostic test for BGP propagation timing investigation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant