Skip to content

Rework tier system: decouple hardware-aware recommendation from rigid tier assignment #39

@weklund

Description

@weklund

Summary

The current tier system bundles two distinct capabilities: (1) hardware-aware model recommendation via the scoring engine, and (2) rigid assignment of models to 3 fixed tiers (fast/standard/longctx). Research into how agent frameworks handle model selection reveals that capability (1) is uniquely valuable and should be preserved, while capability (2) creates friction with downstream consumers and should be reworked into flexible named endpoints.

Research Findings

How Agent Frameworks Handle Model Selection

Framework Approach Multi-Model? Who Decides?
Hermes Agent Single primary + per-task "auxiliary" models (vision, compression, web_extract, delegation). Fallback model config. Yes, per-task User configures
CrewAI Per-agent LLM assignment. Each agent in a crew gets a different LLM. Yes, per-agent User assigns
LangGraph Per-node model assignment. Different LLMs at different graph nodes. Yes, per-node Developer hardcodes
OpenHands Single model. Points at one API endpoint. No User picks one
LiteLLM (our dependency) Built-in complexity router with 7-dimension keyword scoring (LIGHT/STANDARD/REASONING) + semantic auto-routing Yes, automatic Rules-based

Key finding 1: Hermes Agent issue #157 is an open feature request for capability-category routing (fast, reasoning, uncensored, cheap) — a superset of our fixed 3 tiers. Agent frameworks are planning to own this routing layer themselves.

Key finding 2: LiteLLM already has a complexity router that scores requests across 7 dimensions and routes to LIGHT/STANDARD/REASONING tiers. We use LiteLLM as our proxy — this routing capability is available for free without building our own.

Key finding 3: Spacebot (Spacedrive's agent) designed a prompt complexity scorer but never implemented it — their actual routing is process-type based. They learned that letting different process types pick their own model is more useful than rigid complexity scoring.

Is "Which Model to Use" a Real User Pain Point?

Yes, strongly validated. A Reddit thread asking "what's most frustrating about local AI?" got 48 upvotes and 114 comments. Another thread titled "How do you pick the right local LLM?" opens with "there are so many options that I don't even know where to start."

But this pain has two distinct layers:

  1. "Which models should I download?" — model selection for your hardware (fits in memory, fast enough, good quality)
  2. "Which model should handle this specific request?" — request-level routing

Layer 1 is where mlx-stack's scoring engine shines. Layer 2 is what agent frameworks want to own.

The Tension

What the scoring engine does well (keep)

scoring.py evaluates models against specific hardware using intent-weighted composite scoring across speed, quality, tool_calling, and memory_efficiency. It handles bandwidth-ratio estimation for unknown hardware, memory budget enforcement, and benchmark-gated scoring. No agent framework does this. This is mlx-stack's unique differentiator.

What the tier system does poorly (rework)

assign_tiers() forces recommended models into exactly 3 named slots with fixed semantics:

  • standard = highest composite score
  • fast = highest gen_tps (different from standard)
  • longctx = architecturally diverse (mamba2-hybrid)

This creates friction:

  • Hermes wants auxiliary models for vision, compression, web_extract, delegation — categories that don't map to fast/standard/longctx
  • CrewAI wants per-agent assignment where a "researcher" gets one model and a "coder" gets another — they don't think in tiers
  • Hermes #157 proposes user-defined categories (fast, reasoning, uncensored, cheap) — our fixed 3 tiers are a subset
  • LiteLLM complexity routing already handles "route simple requests to small model" — we don't need to duplicate this

The fundamental mismatch: mlx-stack assigns semantic meaning to tiers ("fast" means "for simple tasks"), but agent frameworks want to assign their own semantic meaning to models based on their internal context.

Recommendation: Separate the Two Concerns

Keep: Hardware-Aware Model Recommendation

The scoring engine (score_and_filter(), score_model(), the intent weights, benchmark resolution) is the crown jewel. No changes needed.

Rework: Tier Assignment → Flexible Named Endpoints

Instead of assign_tiers() forcing models into fast/standard/longctx:

  1. recommend outputs a ranked list with scores, not forced tier assignments. "Here are 5 models that fit your hardware, sorted by composite score, with speed/quality/tool_calling scores visible." The user sees the data and decides.

  2. Users (or config generators) assign models to named endpoints. Instead of forcing fast/standard/longctx, allow arbitrary endpoint names:

    • mlx-stack init defaults to primary + secondary (or keeps fast/standard as sensible defaults)
    • mlx-stack init --harness hermes creates endpoints mapped to Hermes's config structure
    • mlx-stack init --harness crewai creates endpoints for CrewAI's per-agent assignment pattern
  3. Expose LiteLLM's complexity routing as opt-in rather than building a custom router. Users who want "auto-route simple requests to small model" enable LiteLLM's built-in complexity router.

  4. The scoring engine remains the recommendation source for recommend, add-model (Arbitrary model support: mlx-stack add-model for models outside the catalog #27), and harness config generators (Agent harness integration guides and config generators #26).

What This Changes

Component Before After
scoring.py assign_tiers() Fixed 3 slots Flexible N slots with configurable semantics
mlx-stack recommend Shows fast/standard/longctx Shows ranked list; user picks roles
mlx-stack init Creates fast/standard/longctx Defaults to sensible named endpoints; --harness flag maps to framework-specific roles
Request-level routing Not implemented Don't build; document how to enable LiteLLM complexity routing
litellm_gen.py Generates config with fixed tier model groups Generates config with user-defined endpoint names

Impact on Other Issues

Phased Delivery

Phase 1 (v0.2): Keep fast/standard/longctx as the default naming convention for backward compatibility, but make the names configurable in stack.yaml. Add --harness flag to init that generates framework-specific endpoint mappings.

Phase 2 (v0.3): Refactor assign_tiers() to support arbitrary slot definitions. Make recommend output a ranked list with scores instead of forced tier assignments. Add LiteLLM complexity routing documentation.

Phase 3 (future): If demand materializes, add an optional request-level routing layer that uses LiteLLM's auto-routing with mlx-stack's model scores as inputs.

Prior Art

  • LiteLLM complexity router — 7-dimension keyword scoring, configurable weights and tier boundaries, already available to mlx-stack
  • Hermes #157 — Community-proposed capability-category routing with evaluation feedback loop
  • Spacebot prompt-routing design doc — 7-dimension scorer with LIGHT/STANDARD/HEAVY tiers (designed but not implemented; actual routing is process-type based)
  • Kalibr — Telemetry-driven automatic model routing based on production success/failure data (no upfront categories)
  • IBM LLM Router research — Analyzes incoming queries and routes to most cost-effective model in real time

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions