You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current tier system bundles two distinct capabilities: (1) hardware-aware model recommendation via the scoring engine, and (2) rigid assignment of models to 3 fixed tiers (fast/standard/longctx). Research into how agent frameworks handle model selection reveals that capability (1) is uniquely valuable and should be preserved, while capability (2) creates friction with downstream consumers and should be reworked into flexible named endpoints.
Research Findings
How Agent Frameworks Handle Model Selection
Framework
Approach
Multi-Model?
Who Decides?
Hermes Agent
Single primary + per-task "auxiliary" models (vision, compression, web_extract, delegation). Fallback model config.
Yes, per-task
User configures
CrewAI
Per-agent LLM assignment. Each agent in a crew gets a different LLM.
Yes, per-agent
User assigns
LangGraph
Per-node model assignment. Different LLMs at different graph nodes.
Key finding 1:Hermes Agent issue #157 is an open feature request for capability-category routing (fast, reasoning, uncensored, cheap) — a superset of our fixed 3 tiers. Agent frameworks are planning to own this routing layer themselves.
Key finding 2: LiteLLM already has a complexity router that scores requests across 7 dimensions and routes to LIGHT/STANDARD/REASONING tiers. We use LiteLLM as our proxy — this routing capability is available for free without building our own.
Key finding 3:Spacebot (Spacedrive's agent) designed a prompt complexity scorer but never implemented it — their actual routing is process-type based. They learned that letting different process types pick their own model is more useful than rigid complexity scoring.
Is "Which Model to Use" a Real User Pain Point?
Yes, strongly validated. A Reddit thread asking "what's most frustrating about local AI?" got 48 upvotes and 114 comments. Another thread titled "How do you pick the right local LLM?" opens with "there are so many options that I don't even know where to start."
But this pain has two distinct layers:
"Which models should I download?" — model selection for your hardware (fits in memory, fast enough, good quality)
"Which model should handle this specific request?" — request-level routing
Layer 1 is where mlx-stack's scoring engine shines. Layer 2 is what agent frameworks want to own.
The Tension
What the scoring engine does well (keep)
scoring.py evaluates models against specific hardware using intent-weighted composite scoring across speed, quality, tool_calling, and memory_efficiency. It handles bandwidth-ratio estimation for unknown hardware, memory budget enforcement, and benchmark-gated scoring. No agent framework does this. This is mlx-stack's unique differentiator.
What the tier system does poorly (rework)
assign_tiers() forces recommended models into exactly 3 named slots with fixed semantics:
standard = highest composite score
fast = highest gen_tps (different from standard)
longctx = architecturally diverse (mamba2-hybrid)
This creates friction:
Hermes wants auxiliary models for vision, compression, web_extract, delegation — categories that don't map to fast/standard/longctx
CrewAI wants per-agent assignment where a "researcher" gets one model and a "coder" gets another — they don't think in tiers
Hermes #157 proposes user-defined categories (fast, reasoning, uncensored, cheap) — our fixed 3 tiers are a subset
LiteLLM complexity routing already handles "route simple requests to small model" — we don't need to duplicate this
The fundamental mismatch: mlx-stack assigns semantic meaning to tiers ("fast" means "for simple tasks"), but agent frameworks want to assign their own semantic meaning to models based on their internal context.
Recommendation: Separate the Two Concerns
Keep: Hardware-Aware Model Recommendation
The scoring engine (score_and_filter(), score_model(), the intent weights, benchmark resolution) is the crown jewel. No changes needed.
Rework: Tier Assignment → Flexible Named Endpoints
Instead of assign_tiers() forcing models into fast/standard/longctx:
recommend outputs a ranked list with scores, not forced tier assignments. "Here are 5 models that fit your hardware, sorted by composite score, with speed/quality/tool_calling scores visible." The user sees the data and decides.
Users (or config generators) assign models to named endpoints. Instead of forcing fast/standard/longctx, allow arbitrary endpoint names:
mlx-stack init defaults to primary + secondary (or keeps fast/standard as sensible defaults)
Expose LiteLLM's complexity routing as opt-in rather than building a custom router. Users who want "auto-route simple requests to small model" enable LiteLLM's built-in complexity router.
Phase 1 (v0.2): Keep fast/standard/longctx as the default naming convention for backward compatibility, but make the names configurable in stack.yaml. Add --harness flag to init that generates framework-specific endpoint mappings.
Phase 2 (v0.3): Refactor assign_tiers() to support arbitrary slot definitions. Make recommend output a ranked list with scores instead of forced tier assignments. Add LiteLLM complexity routing documentation.
Phase 3 (future): If demand materializes, add an optional request-level routing layer that uses LiteLLM's auto-routing with mlx-stack's model scores as inputs.
Prior Art
LiteLLM complexity router — 7-dimension keyword scoring, configurable weights and tier boundaries, already available to mlx-stack
Hermes #157 — Community-proposed capability-category routing with evaluation feedback loop
Spacebot prompt-routing design doc — 7-dimension scorer with LIGHT/STANDARD/HEAVY tiers (designed but not implemented; actual routing is process-type based)
Kalibr — Telemetry-driven automatic model routing based on production success/failure data (no upfront categories)
IBM LLM Router research — Analyzes incoming queries and routes to most cost-effective model in real time
Summary
The current tier system bundles two distinct capabilities: (1) hardware-aware model recommendation via the scoring engine, and (2) rigid assignment of models to 3 fixed tiers (fast/standard/longctx). Research into how agent frameworks handle model selection reveals that capability (1) is uniquely valuable and should be preserved, while capability (2) creates friction with downstream consumers and should be reworked into flexible named endpoints.
Research Findings
How Agent Frameworks Handle Model Selection
Key finding 1: Hermes Agent issue #157 is an open feature request for capability-category routing (fast, reasoning, uncensored, cheap) — a superset of our fixed 3 tiers. Agent frameworks are planning to own this routing layer themselves.
Key finding 2: LiteLLM already has a complexity router that scores requests across 7 dimensions and routes to LIGHT/STANDARD/REASONING tiers. We use LiteLLM as our proxy — this routing capability is available for free without building our own.
Key finding 3: Spacebot (Spacedrive's agent) designed a prompt complexity scorer but never implemented it — their actual routing is process-type based. They learned that letting different process types pick their own model is more useful than rigid complexity scoring.
Is "Which Model to Use" a Real User Pain Point?
Yes, strongly validated. A Reddit thread asking "what's most frustrating about local AI?" got 48 upvotes and 114 comments. Another thread titled "How do you pick the right local LLM?" opens with "there are so many options that I don't even know where to start."
But this pain has two distinct layers:
Layer 1 is where mlx-stack's scoring engine shines. Layer 2 is what agent frameworks want to own.
The Tension
What the scoring engine does well (keep)
scoring.pyevaluates models against specific hardware using intent-weighted composite scoring across speed, quality, tool_calling, and memory_efficiency. It handles bandwidth-ratio estimation for unknown hardware, memory budget enforcement, and benchmark-gated scoring. No agent framework does this. This is mlx-stack's unique differentiator.What the tier system does poorly (rework)
assign_tiers()forces recommended models into exactly 3 named slots with fixed semantics:standard= highest composite scorefast= highest gen_tps (different from standard)longctx= architecturally diverse (mamba2-hybrid)This creates friction:
The fundamental mismatch: mlx-stack assigns semantic meaning to tiers ("fast" means "for simple tasks"), but agent frameworks want to assign their own semantic meaning to models based on their internal context.
Recommendation: Separate the Two Concerns
Keep: Hardware-Aware Model Recommendation
The scoring engine (
score_and_filter(),score_model(), the intent weights, benchmark resolution) is the crown jewel. No changes needed.Rework: Tier Assignment → Flexible Named Endpoints
Instead of
assign_tiers()forcing models into fast/standard/longctx:recommendoutputs a ranked list with scores, not forced tier assignments. "Here are 5 models that fit your hardware, sorted by composite score, with speed/quality/tool_calling scores visible." The user sees the data and decides.Users (or config generators) assign models to named endpoints. Instead of forcing fast/standard/longctx, allow arbitrary endpoint names:
mlx-stack initdefaults toprimary+secondary(or keeps fast/standard as sensible defaults)mlx-stack init --harness hermescreates endpoints mapped to Hermes's config structuremlx-stack init --harness crewaicreates endpoints for CrewAI's per-agent assignment patternExpose LiteLLM's complexity routing as opt-in rather than building a custom router. Users who want "auto-route simple requests to small model" enable LiteLLM's built-in complexity router.
The scoring engine remains the recommendation source for
recommend,add-model(Arbitrary model support:mlx-stack add-modelfor models outside the catalog #27), and harness config generators (Agent harness integration guides and config generators #26).What This Changes
scoring.pyassign_tiers()mlx-stack recommendmlx-stack init--harnessflag maps to framework-specific roleslitellm_gen.pyImpact on Other Issues
--harness hermescan map recommended models to Hermes-specific roles instead of generic tiersmlx-stack add-modelfor models outside the catalog #27 (arbitrary model support): Fully compatible — user-added models join the scored catalog and can be assigned to any named endpointPhased Delivery
Phase 1 (v0.2): Keep fast/standard/longctx as the default naming convention for backward compatibility, but make the names configurable in stack.yaml. Add
--harnessflag toinitthat generates framework-specific endpoint mappings.Phase 2 (v0.3): Refactor
assign_tiers()to support arbitrary slot definitions. Makerecommendoutput a ranked list with scores instead of forced tier assignments. Add LiteLLM complexity routing documentation.Phase 3 (future): If demand materializes, add an optional request-level routing layer that uses LiteLLM's auto-routing with mlx-stack's model scores as inputs.
Prior Art