[server][controller] Change defaults to production values#2863
Open
kvargha wants to merge 15 commits into
Open
[server][controller] Change defaults to production values#2863kvargha wants to merge 15 commits into
kvargha wants to merge 15 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Aligns OSS sample server/docker configs and controller-cluster Helix resource setup with production/code defaults, so users starting from provided samples don’t unknowingly diverge from the standard operational behavior.
Changes:
- Removed sample overrides for server leader-promotion delay and Netty graceful shutdown period so samples inherit existing code defaults (300s and 30s respectively).
- Updated controller-cluster resource rebalance strategy for the storage-cluster resource from CRUSH to CRUSHED (
CrushEdRebalanceStrategy). - Updated the related comment in
VeniceHelixAdminto reflect the strategy intent.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| services/venice-standalone/config/server.properties | Removes sample overrides so standalone inherits code/prod defaults for promotion delay and graceful shutdown. |
| docker/venice-server/single-dc-configs/server.properties | Removes docker single-DC sample overrides so defaults match production/code behavior. |
| docker/venice-server/multi-dc-configs/dc-0/server.properties | Removes docker multi-DC (dc-0) sample overrides so defaults match production/code behavior. |
| docker/venice-server/multi-dc-configs/dc-1/server.properties | Removes docker multi-DC (dc-1) sample overrides so defaults match production/code behavior. |
| services/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceHelixAdmin.java | Switches controller-cluster resource rebalance strategy to CRUSHED for the storage-cluster resource. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
29fd31c to
4cff363
Compare
4cff363 to
3a9cdab
Compare
3a9cdab to
ba85d80
Compare
fa27b0b to
d75fb7e
Compare
73344c8 to
48f75af
Compare
9e8c6c6 to
9ccf2cd
Compare
Number the clusterProperties locals in testInvalidRebalancePreferenceAndCapacityKeys 1-7 in declaration order after the capacity positivity cases were inserted. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Switch the legacy (non-HaaS) controller-cluster resource rebalancer from DelayedAutoRebalancer to WagedRebalancer, matching the HaaS path. WAGED computes an all-or-nothing assignment, so it cannot place a controller-cluster resource that requests more replicas than there are controllers; DelayedAutoRebalancer tolerated that under-replication. The integration-test framework runs 1 controller by default but left controller.cluster.replica at its default of 3, so thread the controller count through VeniceControllerCreateOptions and set CONTROLLER_CLUSTER_REPLICA = min(numberOfControllers, 3), mirroring how kafka.replication.factor is set to match the single test broker. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
controller cluster WAGED computes its baseline assignment asynchronously by default (GLOBAL_REBALANCE_ASYNC_MODE defaults to true), so a freshly added controller-cluster resource gets an empty assignment on the first rebalance pipeline run and only converges after the async baseline finishes and re-triggers the pipeline. The non-HaaS controller runs that pipeline in-process and blocks in waitUntilClusterResourceIsVisibleInEV during startup, so the async convergence can miss the startup window and the controller fails to start. Set GLOBAL_REBALANCE_ASYNC_MODE=false on the controller cluster so WAGED computes the baseline synchronously in the first pipeline run and the external view becomes visible promptly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
cluster The non-HaaS controller-cluster creation path only set auto-join, topology-aware=false, and synchronous global rebalance, while the HaaS path additionally applies the global rebalance preference, the capacity config, and persist-best-possible. Mirror the HaaS WAGED cluster config here so both paths assign the controller-cluster resource the same way. Switch from setConfig(HelixConfigScope, Map) to updateClusterConfigs(ClusterConfig) because the structured rebalance-preference and capacity maps cannot be carried by the string-keyed config map. Keep GLOBAL_REBALANCE_ASYNC_MODE=false, which the non-HaaS in-process pipeline needs but the HaaS dedicated controller does not. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…gy config The controller cluster is managed by WAGED, not CRUSH, so the comment claiming the topology/fault-zone fields are used by the CRUSH algorithm is inaccurate. WAGED also consumes these fields (AssignableNode computes the fault zone from the cluster topology config), so reword to describe fault-zone-aware placement for the configured rebalancer in both the non-HaaS (VeniceHelixAdmin) and HaaS (ZkHelixAdminClient) paths. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tion value Change the default from 120000 (2 min) to 960000 (16 min), matching the value used in production. This is the controller-side ceiling for how long addVersion waits for a new version's replicas to be assigned before failing the push; the longer budget keeps in-flight pushes alive across transient storage-node disruption such as rolling restarts. Update the adjacent comment, which referenced a stale 3-minute client timeout, to describe the actual coupling with the controller client's per-request timeout. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ve the controller wait The version-creation (request_topic) call can block on the controller for up to offline.job.start.timeout.ms while addVersion waits for the new version's replicas to be assigned. The client request timeout was hardcoded at 10 min (DEFAULT_REQUEST_TIMEOUT_MS), so once that controller-side wait was raised to 16 min, a slow assignment made the client time out first; the inner ControllerClient retry loop then re-issued the non-idempotent request_topic, creating extra abandoned versions. Add a configurable controller.request.topic.timeout.ms (default 20 min, above the 16-min controller wait) and thread it from the VPJ to a new requestTopicForWrites overload, so the controller always responds (with success or its own error) before the client times out. Other controller calls keep the existing 10-min default. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ble, above the" This reverts commit 4fe3090.
…tart wait The request_topic (version creation) call blocks on the controller for up to offline.job.start.timeout.ms (16 min) while addVersion waits for the new version's replicas to be assigned. The client per-request timeout was 10 min, so a slow assignment made the client time out first and retry the non-idempotent create. Raise DEFAULT_REQUEST_TIMEOUT_MS to 17 min so the controller responds before the client gives up. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
VeniceControllerClusterConfig now always provides a non-null rebalance preference and capacity config (production defaults), so the null-guards around setGlobalRebalancePreference and the capacity setters in both the non-HaaS (VeniceHelixAdmin) and HaaS (ZkHelixAdminClient) controller-cluster setup are dead code. Remove them. Delete the TestZkHelixAdminClient cases that fed null/partial preference/capacity through the mock (now-impossible states; the config defaults and validation are covered by TestVeniceControllerClusterConfig), and stub the defaults in setUp so the remaining real-method tests reflect the non-null contract. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
TestHAASController The controller-cluster WAGED config setters no longer guard against a null rebalance preference or capacity config, because VeniceControllerClusterConfig always provides them from production defaults. TestHAASController's testCloudConfig and testConcurrentClusterInitialization mock VeniceControllerMultiClusterConfig, whose unstubbed getters returned null and previously relied on the removed guards. Stub both getters non-null to mirror the production contract so createVeniceControllerCluster completes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The implicit CONTROLLER_CLUSTER_REPLICA override derived the replica count from options.getNumberOfControllers() capped at 3. A cluster built with numberOfControllers(0) that later adds a controller would set the replica count to 0, producing an invalid Helix IdealState (replicas=0, minActiveReplicas=1) and WAGED rebalance failures. Clamp the lower bound to 1. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem Statement
From the server/controller production-defaults audit:
VeniceHelixAdmin) does not apply the WAGED cluster config the HaaS path (ZkHelixAdminClient) sets.Solution
Server sample defaults
server.promotion.to.leader.replica.delay.seconds=1andserver.netty.graceful.shutdown.period.seconds=0from the server Docker/standalone samples so they inherit the code defaults (300s/30s).Controller WAGED defaults
EVENNESS=10,LESS_MOVEMENT=1,FORCE_BASELINE_CONVERGE=0) and capacity (instance10000, weight100) inVeniceControllerClusterConfiginstead of leaving them unset.VeniceHelixAdminswitches toWagedRebalancer;ZkHelixAdminClientdrops the explicitAutoRebalanceStrategy.ZkHelixAdminClient.GLOBAL_REBALANCE_ASYNC_MODE=falseon the controller cluster: the non-HaaS controller runs the Helix pipeline in-process and blocks on the resource appearing in the external view at startup, but WAGED computes its baseline async by default, so the first pipeline run is empty and can miss that startup window. Synchronous global rebalance computes the assignment on the first run.numberOfControllersthrough the test controller-creation path and capCONTROLLER_CLUSTER_REPLICAto the controller count, avoiding WAGEDFAILED_TO_CALCULATE.Push-start timeouts
offline.job.start.timeout.ms:120000(2 min) ->960000(16 min), the production value (controller-side wait for a new version's replicas to be assigned).ControllerClient.DEFAULT_REQUEST_TIMEOUT_MS: 10 min -> 17 min, above the 16-min controller wait, sorequest_topicdoes not expire client-side and retry the non-idempotent create.No other capacity, hardware, topology, blob-transfer, server-shutdown, or partition-drop defaults change.
Code changes
Concurrency-Specific Checks
Both reviewer and PR author to verify
synchronized,RWLock) are used where needed.ConcurrentHashMap,CopyOnWriteArrayList).How was this PR tested?
Extended
TestVeniceControllerClusterConfigandTestZkHelixAdminClient; validated WAGED controller-cluster convergence via the existing multi-region E2E suite. Full CI passes.Does this PR introduce any user-facing or breaking changes?
offline.job.start.timeout.msgoes 2 min -> 16 min and the controller client request timeout 10 min -> 17 min.