Skip to content

Commit e3dcf9a

Browse files
weklundclaude
andauthored
feat: 4-tier integration testing framework (#16)
## Summary - **Adds a comprehensive integration testing framework** that proves the core user contract: models serve inference, tool calling works, LiteLLM routing works, and coding agents can connect via the OpenAI client. - **Adds catalog validation to CI** so broken HuggingFace repo URLs (like the qwen3.5-8b 404 in #15) are caught before merge. - **Adds nightly and pre-release CI workflows** for smoke tests and full stack integration. ## Test Tiers | Tier | Marker | What it proves | When it runs | |------|--------|----------------|--------------| | **1. Catalog Validation** | `catalog_validation` | All HF repos exist, fields valid, capabilities consistent | Every PR (CI) | | **2. Model Smoke** | `smoke` | Each catalog model loads and serves inference, tool calling, thinking | Nightly | | **3. Stack Integration** | `integration` | Full lifecycle, LiteLLM routing, concurrent requests, clean shutdown | Pre-release | | **4. Harness Compatibility** | `harness` | OpenAI Python client works (chat, streaming, tool calling, multi-turn) | Pre-release | ## New Make Targets ``` make test-catalog # Tier 1 — fast, requires network make test-smoke # Tier 2 — slow, requires macOS + vllm-mlx make test-integration # Tier 3 — slow, requires macOS + vllm-mlx + litellm make test-harness # Tier 4 — slow, requires above + openai package ``` ## Key Infrastructure - **Dynamic port allocation** — no hardcoded ports, no conflicts with running stacks - **ServiceManager** — context manager with guaranteed cleanup (SIGTERM → SIGKILL → port verification) - **Persistent model cache** — `~/.mlx-stack-test-cache/models/` avoids re-downloading across runs - **Compatibility matrix** — JSON report of pass/fail per model per capability - **Skip decorators** — graceful degradation for non-macOS, insufficient memory, missing deps ## Verification - `make check` passes (lint + typecheck + 1,421 unit tests) - New tests correctly deselected from default `make test` run (168 deselected) - No changes to existing test behavior ## Test plan - [ ] `make check` passes on CI (lint, typecheck, unit tests) - [ ] `make test-catalog` validates catalog entries against HuggingFace API - [ ] Nightly workflow triggers and runs smoke tests on smallest models - [ ] Pre-release workflow triggers on release creation 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent a4e29a8 commit e3dcf9a

34 files changed

Lines changed: 5872 additions & 62 deletions

.github/workflows/ci.yml

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,3 +49,31 @@ jobs:
4949
with:
5050
name: coverage-report
5151
path: coverage.xml
52+
53+
catalog-validation:
54+
name: Catalog Validation
55+
runs-on: macos-latest
56+
needs: ci
57+
strategy:
58+
matrix:
59+
python-version: ["3.13"]
60+
61+
steps:
62+
- name: Checkout code
63+
uses: actions/checkout@v6
64+
with:
65+
fetch-depth: 0
66+
67+
- name: Install uv
68+
uses: astral-sh/setup-uv@v7
69+
with:
70+
enable-cache: true
71+
72+
- name: Install Python ${{ matrix.python-version }}
73+
run: uv python install ${{ matrix.python-version }}
74+
75+
- name: Install dependencies
76+
run: make install
77+
78+
- name: Validate catalog entries
79+
run: make test-catalog
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
name: Nightly Integration
2+
3+
on:
4+
schedule:
5+
- cron: "0 6 * * *" # 6 AM UTC daily
6+
workflow_dispatch:
7+
inputs:
8+
model:
9+
description: "Specific model ID to test (blank = all non-gated models that fit in memory)"
10+
required: false
11+
12+
concurrency:
13+
group: nightly-integration
14+
cancel-in-progress: true
15+
16+
jobs:
17+
smoke:
18+
name: Model Smoke Tests
19+
runs-on: macos-latest
20+
timeout-minutes: 120
21+
22+
steps:
23+
- name: Checkout code
24+
uses: actions/checkout@v6
25+
with:
26+
fetch-depth: 0
27+
28+
- name: Install uv
29+
uses: astral-sh/setup-uv@v7
30+
with:
31+
enable-cache: true
32+
33+
- name: Install Python 3.13
34+
run: uv python install 3.13
35+
36+
- name: Install project dependencies
37+
run: make install
38+
39+
- name: Install vllm-mlx
40+
run: uv tool install vllm-mlx==0.2.6
41+
42+
- name: Run smoke tests
43+
run: |
44+
MODEL_FILTER="${{ github.event.inputs.model }}"
45+
if [ -n "$MODEL_FILTER" ]; then
46+
uv run pytest -m smoke -k "$MODEL_FILTER" -v --tb=long
47+
else
48+
uv run pytest -m smoke -v --tb=long
49+
fi
50+
env:
51+
HF_TOKEN: ${{ secrets.HF_TOKEN }}
52+
53+
- name: Upload compatibility matrix
54+
uses: actions/upload-artifact@v4
55+
if: always()
56+
with:
57+
name: compatibility-matrix
58+
path: ~/.mlx-stack-test-cache/reports/
Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
name: Pre-Release Integration
2+
3+
on:
4+
release:
5+
types: [created]
6+
workflow_dispatch:
7+
8+
concurrency:
9+
group: prerelease-integration
10+
cancel-in-progress: true
11+
12+
jobs:
13+
stack-integration:
14+
name: Full Stack Integration
15+
runs-on: macos-latest
16+
timeout-minutes: 60
17+
18+
steps:
19+
- name: Checkout code
20+
uses: actions/checkout@v6
21+
with:
22+
fetch-depth: 0
23+
24+
- name: Install uv
25+
uses: astral-sh/setup-uv@v7
26+
with:
27+
enable-cache: true
28+
29+
- name: Install Python 3.13
30+
run: uv python install 3.13
31+
32+
- name: Install project dependencies
33+
run: uv sync --dev
34+
35+
- name: Install openai for harness tests
36+
run: uv pip install openai>=1.0
37+
38+
- name: Install vllm-mlx and litellm
39+
run: |
40+
uv tool install vllm-mlx==0.2.6
41+
uv tool install "litellm[proxy]==1.83.0"
42+
43+
- name: Run stack integration tests
44+
run: uv run pytest -m "integration or harness" -v --tb=long
45+
env:
46+
HF_TOKEN: ${{ secrets.HF_TOKEN }}
47+
48+
- name: Upload test results
49+
uses: actions/upload-artifact@v4
50+
if: always()
51+
with:
52+
name: integration-results
53+
path: ~/.mlx-stack-test-cache/reports/

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,3 +34,4 @@ htmlcov/
3434

3535
# Ruff
3636
.ruff_cache/
37+
REQUIREMENTS.md

Makefile

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
.PHONY: install lint typecheck test check
1+
.PHONY: install lint typecheck test check test-catalog test-smoke test-integration test-harness
22

33
## Install dev dependencies
44
install:
@@ -18,3 +18,19 @@ test:
1818

1919
## Run all checks (same as CI)
2020
check: lint typecheck test
21+
22+
## Run catalog validation (requires network, no models)
23+
test-catalog:
24+
uv run pytest -m catalog_validation -v --tb=short
25+
26+
## Run per-model smoke tests (requires macOS + vllm-mlx)
27+
test-smoke:
28+
uv run pytest -m smoke -v --tb=long
29+
30+
## Run full stack integration tests (requires macOS + vllm-mlx + litellm)
31+
test-integration:
32+
uv run pytest -m integration -v --tb=long
33+
34+
## Run harness compatibility tests (requires above + openai)
35+
test-harness:
36+
uv run pytest -m harness -v --tb=long

pyproject.toml

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -53,9 +53,12 @@ packages = ["src/mlx_stack"]
5353
[tool.pytest.ini_options]
5454
testpaths = ["tests"]
5555
pythonpath = ["src"]
56-
addopts = "-x -q --tb=short -m 'not integration'"
56+
addopts = "-x -q --tb=short -m 'not integration and not smoke and not catalog_validation and not harness'"
5757
markers = [
5858
"integration: real system integration tests (launchctl, etc.)",
59+
"smoke: per-model smoke tests requiring model download and inference",
60+
"catalog_validation: catalog data integrity checks (requires network, no models)",
61+
"harness: external tool compatibility tests (OpenAI client, streaming)",
5962
]
6063

6164
[tool.ruff]

src/mlx_stack/cli/main.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@
2525
from mlx_stack.cli.profile import profile as profile_command
2626
from mlx_stack.cli.pull import pull as pull_command
2727
from mlx_stack.cli.recommend import recommend as recommend_command
28+
from mlx_stack.cli.setup import setup as setup_command
2829
from mlx_stack.cli.status import status as status_command
2930
from mlx_stack.cli.up import up as up_command
3031
from mlx_stack.cli.watch import watch as watch_command
@@ -73,6 +74,7 @@ def format_help(self, ctx: click.Context, formatter: click.HelpFormatter) -> Non
7374
}
7475

7576
command_categories = {
77+
"setup": "Setup & Configuration",
7678
"profile": "Setup & Configuration",
7779
"config": "Setup & Configuration",
7880
"init": "Setup & Configuration",
@@ -165,6 +167,7 @@ def cli(ctx: click.Context) -> None:
165167
# These will be replaced by real implementations in subsequent features.
166168

167169

170+
cli.add_command(setup_command, "setup")
168171
cli.add_command(profile_command, "profile")
169172
cli.add_command(recommend_command, "recommend")
170173
cli.add_command(init_command, "init")

0 commit comments

Comments
 (0)