-
Notifications
You must be signed in to change notification settings - Fork 46
Feature: Adaptive resolver balancing strategy (weighted scoring) #80
Description
Problem
با توجه به اینکه MTU روی 40 هست . مقدار Resolver هایی که پیدا میشه زیاد هست . اگر یک Balancing Strategy جدید بزنیم که با توجه به کارایی resolver ها تصمیم بگیره از کدوم ها استفاده کنه . بنظرم خیلی میتونه بهتر باشه تا اینکه از همه resolver ها به یک مقدار یکسان استفاده کنیم.
با AI در مورد مشکلات و Corner case ها صحبت کردم که اگه اوکی هست این فیچر اضافه بشه.
با توجه به اینکه هر هر دقیقه ممکنه یک resolver ضعیف تر یا قوی تر شه . یه سیستم score میشه براش ساخت که با توجه به وضعیت اون resolver طی تایمی که باینری ران شده تصمیم بگیره که کدوم بصورت overall در بهترین حالت ممکن قرار داره.
After running for 10+ minutes with a large resolver pool (e.g., 500 resolvers), the client has collected extensive performance data per resolver — but all valid resolvers still receive equal traffic. This creates a wasteful cycle:
- Flaky resolvers get equal traffic → timeouts occur
- Auto-disable kicks in → resolver disabled
- Recheck probe passes → resolver reactivated with full traffic
- Cycle repeats every few seconds/minutes
The result is constant resolver churn visible in logs (🛑 disabled → 🔄 re-activated), reduced effective resolver pool, and slower throughput.
Proposal: New RESOLVER_BALANCING_STRATEGY = "adaptive" mode
A weighted selection strategy that learns from real-time resolver performance and routes more traffic to better-performing resolvers.
Core design
-
Sliding window scoring — Each resolver maintains a performance score based on observations within a configurable recent time window (e.g., last 5 minutes). Old data expires automatically, so the system adapts when resolver quality changes over time (ISP throttling, route changes, resolver load fluctuations).
-
Score formula — Combines success rate and latency:
successRate = successes / totalObservationslatencyFactor = 1 / normalizedAvgRTTscore = successRate * latencyFactor- Scores are relative (compared against the pool), not absolute.
-
Weighted random selection — Resolvers are picked proportionally to their score. A resolver with score 0.9 gets ~3x more traffic than one with score 0.3, but the low-score resolver still gets some traffic.
-
Minimum exploration traffic — Every resolver receives at least a configurable minimum percentage of baseline traffic (e.g., 5-10%) regardless of score. This ensures:
- Low-score resolvers can prove recovery
- Newly added/reactivated resolvers get a fair evaluation period
- Score staleness is prevented
-
Global congestion detection — If a large percentage of resolvers (e.g., >70%) degrade simultaneously (which happens during heavy uploads/downloads), score updates are dampened or frozen. This prevents a big file transfer from unfairly tanking all resolver scores — the congestion is on our side, not the resolvers'.
-
Outlier dampening — A short burst of 3-4 timeouts shouldn't permanently destroy a resolver's score. Require a minimum observation count before a score is considered statistically significant. Single-event spikes are smoothed.
-
Neutral start for new/reactivated resolvers — Resolvers with no recent data start at the pool's average score, not the bottom. They earn their final position through actual performance.
Expected behavior after 10 minutes with 500 resolvers
- Top ~100 resolvers (stable, fast) handle ~70-80% of traffic
- Middle ~200 resolvers (decent, occasional issues) handle ~15-20% of traffic
- Bottom ~200 resolvers (flaky, slow) handle ~5-10% of traffic (exploration)
- Auto-disable becomes a last resort for truly dead resolvers, not the primary mechanism
- Log spam from disable/reactivate churn is dramatically reduced
Suggested configuration
RESOLVER_BALANCING_STRATEGY = "adaptive" # new mode (existing: round-robin, random)
ADAPTIVE_SCORE_WINDOW_SECONDS = 300.0 # sliding window for scoring (5 min)
ADAPTIVE_MIN_OBSERVATIONS = 10 # minimum samples before score is trusted
ADAPTIVE_EXPLORATION_PERCENT = 5 # minimum traffic % for low-score resolvers
ADAPTIVE_CONGESTION_THRESHOLD = 0.7 # freeze scores if >70% resolvers degrade at once
ADAPTIVE_SCORE_DECAY_FACTOR = 0.95 # exponential decay for older observations within windowCompatibility
- Existing
round-robinandrandomstrategies remain unchanged as default adaptiveis opt-in- The existing auto-disable/recheck system remains as a safety net underneath adaptive scoring