Rework tier system: decouple hardware-aware recommendation from rigid tier assignment

## Summary

The current tier system bundles two distinct capabilities: (1) hardware-aware model recommendation via the scoring engine, and (2) rigid assignment of models to 3 fixed tiers (fast/standard/longctx). Research into how agent frameworks handle model selection reveals that capability (1) is uniquely valuable and should be preserved, while capability (2) creates friction with downstream consumers and should be reworked into flexible named endpoints.

## Research Findings

### How Agent Frameworks Handle Model Selection

| Framework | Approach | Multi-Model? | Who Decides? |
|-----------|----------|--------------|--------------|
| **Hermes Agent** | Single primary + per-task "auxiliary" models (vision, compression, web_extract, delegation). Fallback model config. | Yes, per-task | User configures |
| **CrewAI** | Per-agent LLM assignment. Each agent in a crew gets a different LLM. | Yes, per-agent | User assigns |
| **LangGraph** | Per-node model assignment. Different LLMs at different graph nodes. | Yes, per-node | Developer hardcodes |
| **OpenHands** | Single model. Points at one API endpoint. | No | User picks one |
| **LiteLLM** (our dependency) | Built-in complexity router with 7-dimension keyword scoring (LIGHT/STANDARD/REASONING) + semantic auto-routing | Yes, automatic | Rules-based |

**Key finding 1:** [Hermes Agent issue #157](https://github.com/NousResearch/hermes-agent/issues/157) is an open feature request for capability-category routing (fast, reasoning, uncensored, cheap) — a superset of our fixed 3 tiers. Agent frameworks are planning to own this routing layer themselves.

**Key finding 2:** LiteLLM already has a [complexity router](https://docs.litellm.ai/docs/proxy/auto_routing) that scores requests across 7 dimensions and routes to LIGHT/STANDARD/REASONING tiers. We use LiteLLM as our proxy — this routing capability is available for free without building our own.

**Key finding 3:** [Spacebot](https://github.com/spacedriveapp/spacebot) (Spacedrive's agent) designed a prompt complexity scorer but never implemented it — their actual routing is process-type based. They learned that letting different process types pick their own model is more useful than rigid complexity scoring.

### Is "Which Model to Use" a Real User Pain Point?

**Yes, strongly validated.** A [Reddit thread](https://www.reddit.com/r/LocalLLM/comments/1q6ukn0/) asking "what's most frustrating about local AI?" got 48 upvotes and 114 comments. Another [thread](https://www.reddit.com/r/LocalLLaMA/comments/1lq2wn6/) titled "How do you pick the right local LLM?" opens with "there are so many options that I don't even know where to start."

But this pain has two distinct layers:
1. **"Which models should I download?"** — model selection for your hardware (fits in memory, fast enough, good quality)
2. **"Which model should handle this specific request?"** — request-level routing

Layer 1 is where mlx-stack's scoring engine shines. Layer 2 is what agent frameworks want to own.

## The Tension

### What the scoring engine does well (keep)

`scoring.py` evaluates models against specific hardware using intent-weighted composite scoring across speed, quality, tool_calling, and memory_efficiency. It handles bandwidth-ratio estimation for unknown hardware, memory budget enforcement, and benchmark-gated scoring. **No agent framework does this.** This is mlx-stack's unique differentiator.

### What the tier system does poorly (rework)

`assign_tiers()` forces recommended models into exactly 3 named slots with fixed semantics:
- `standard` = highest composite score
- `fast` = highest gen_tps (different from standard)
- `longctx` = architecturally diverse (mamba2-hybrid)

This creates friction:
- **Hermes** wants auxiliary models for vision, compression, web_extract, delegation — categories that don't map to fast/standard/longctx
- **CrewAI** wants per-agent assignment where a "researcher" gets one model and a "coder" gets another — they don't think in tiers
- **Hermes #157** proposes user-defined categories (fast, reasoning, uncensored, cheap) — our fixed 3 tiers are a subset
- **LiteLLM complexity routing** already handles "route simple requests to small model" — we don't need to duplicate this

**The fundamental mismatch:** mlx-stack assigns semantic meaning to tiers ("fast" means "for simple tasks"), but agent frameworks want to assign their own semantic meaning to models based on their internal context.

## Recommendation: Separate the Two Concerns

### Keep: Hardware-Aware Model Recommendation

The scoring engine (`score_and_filter()`, `score_model()`, the intent weights, benchmark resolution) is the crown jewel. No changes needed.

### Rework: Tier Assignment → Flexible Named Endpoints

Instead of `assign_tiers()` forcing models into fast/standard/longctx:

1. **`recommend` outputs a ranked list with scores, not forced tier assignments.** "Here are 5 models that fit your hardware, sorted by composite score, with speed/quality/tool_calling scores visible." The user sees the data and decides.

2. **Users (or config generators) assign models to named endpoints.** Instead of forcing fast/standard/longctx, allow arbitrary endpoint names:
   - `mlx-stack init` defaults to `primary` + `secondary` (or keeps fast/standard as sensible defaults)
   - `mlx-stack init --harness hermes` creates endpoints mapped to Hermes's config structure
   - `mlx-stack init --harness crewai` creates endpoints for CrewAI's per-agent assignment pattern

3. **Expose LiteLLM's complexity routing as opt-in** rather than building a custom router. Users who want "auto-route simple requests to small model" enable LiteLLM's built-in complexity router.

4. **The scoring engine remains the recommendation source** for `recommend`, `add-model` (#27), and harness config generators (#26).

### What This Changes

| Component | Before | After |
|-----------|--------|-------|
| `scoring.py` `assign_tiers()` | Fixed 3 slots | Flexible N slots with configurable semantics |
| `mlx-stack recommend` | Shows fast/standard/longctx | Shows ranked list; user picks roles |
| `mlx-stack init` | Creates fast/standard/longctx | Defaults to sensible named endpoints; `--harness` flag maps to framework-specific roles |
| Request-level routing | Not implemented | Don't build; document how to enable LiteLLM complexity routing |
| `litellm_gen.py` | Generates config with fixed tier model groups | Generates config with user-defined endpoint names |

## Impact on Other Issues

- **#26 (harness integration guides):** Becomes more valuable — `--harness hermes` can map recommended models to Hermes-specific roles instead of generic tiers
- **#27 (arbitrary model support):** Fully compatible — user-added models join the scored catalog and can be assigned to any named endpoint
- **#28 (embeddings):** Unaffected
- **#29 (reliability layer):** Unaffected — circuit breakers and loop detection work regardless of endpoint naming
- **#30 (KV caching):** Unaffected

## Phased Delivery

**Phase 1 (v0.2):** Keep fast/standard/longctx as the default naming convention for backward compatibility, but make the names configurable in stack.yaml. Add `--harness` flag to `init` that generates framework-specific endpoint mappings.

**Phase 2 (v0.3):** Refactor `assign_tiers()` to support arbitrary slot definitions. Make `recommend` output a ranked list with scores instead of forced tier assignments. Add LiteLLM complexity routing documentation.

**Phase 3 (future):** If demand materializes, add an optional request-level routing layer that uses LiteLLM's auto-routing with mlx-stack's model scores as inputs.

## Prior Art

- **LiteLLM complexity router** — 7-dimension keyword scoring, configurable weights and tier boundaries, already available to mlx-stack
- **Hermes #157** — Community-proposed capability-category routing with evaluation feedback loop
- **Spacebot prompt-routing design doc** — 7-dimension scorer with LIGHT/STANDARD/HEAVY tiers (designed but not implemented; actual routing is process-type based)
- **Kalibr** — Telemetry-driven automatic model routing based on production success/failure data (no upfront categories)
- **IBM LLM Router research** — Analyzes incoming queries and routes to most cost-effective model in real time

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework tier system: decouple hardware-aware recommendation from rigid tier assignment #39

Summary

Research Findings

How Agent Frameworks Handle Model Selection

Is "Which Model to Use" a Real User Pain Point?

The Tension

What the scoring engine does well (keep)

What the tier system does poorly (rework)

Recommendation: Separate the Two Concerns

Keep: Hardware-Aware Model Recommendation

Rework: Tier Assignment → Flexible Named Endpoints

What This Changes

Impact on Other Issues

Phased Delivery

Prior Art

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Framework	Approach	Multi-Model?	Who Decides?
Hermes Agent	Single primary + per-task "auxiliary" models (vision, compression, web_extract, delegation). Fallback model config.	Yes, per-task	User configures
CrewAI	Per-agent LLM assignment. Each agent in a crew gets a different LLM.	Yes, per-agent	User assigns
LangGraph	Per-node model assignment. Different LLMs at different graph nodes.	Yes, per-node	Developer hardcodes
OpenHands	Single model. Points at one API endpoint.	No	User picks one
LiteLLM (our dependency)	Built-in complexity router with 7-dimension keyword scoring (LIGHT/STANDARD/REASONING) + semantic auto-routing	Yes, automatic	Rules-based

Component	Before	After
`scoring.py` `assign_tiers()`	Fixed 3 slots	Flexible N slots with configurable semantics
`mlx-stack recommend`	Shows fast/standard/longctx	Shows ranked list; user picks roles
`mlx-stack init`	Creates fast/standard/longctx	Defaults to sensible named endpoints; `--harness` flag maps to framework-specific roles
Request-level routing	Not implemented	Don't build; document how to enable LiteLLM complexity routing
`litellm_gen.py`	Generates config with fixed tier model groups	Generates config with user-defined endpoint names

Rework tier system: decouple hardware-aware recommendation from rigid tier assignment #39

Description

Summary

Research Findings

How Agent Frameworks Handle Model Selection

Is "Which Model to Use" a Real User Pain Point?

The Tension

What the scoring engine does well (keep)

What the tier system does poorly (rework)

Recommendation: Separate the Two Concerns

Keep: Hardware-Aware Model Recommendation

Rework: Tier Assignment → Flexible Named Endpoints

What This Changes

Impact on Other Issues

Phased Delivery

Prior Art

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions