test: restructure test suite into four distinct execution tiers by visahak · Pull Request #128 · AgentToolkit/altk-evolve

visahak · 2026-04-02T01:47:43Z

Background:
The existing test infrastructure suffered from "unknown mark" Pytest warnings, CLI timeout hangs during heavy testing, and a lack of explicit categorization between core logic vs. local infrastructure operations.
What Changed:

Tiered Architecture: Reorganized the test suite into exactly four non-overlapping tiers natively registered in pyproject.toml (no more warnings):
- unit (246 tests) — Core logic + offline Phoenix testing.
- platform_integrations (18 tests) — Fast filesystem local installation tests.
- llm (62 tests) — Active remote LLM inference pipelines.
- e2e (12 tests) — Heavy infrastructure testing simulating full workflows.
Infrastructure Stability: Added a robust phoenix_server fixture in the E2E suite to autonomously spin up and tear down a background observability server during tests.
Better Error Logging: Solved the evolve sync CLI test "phantom hangs" by capturing and aggregating subprocess logs directly, ensuring error stack traces are printed correctly.
Documentation: Updated README.md and docs/LOW_CODE_TRACING.md to reflect the new architecture, removing the obsolete --run-phoenix execution commands.

Summary by CodeRabbit

Documentation
- Updated test running instructions to a 4-tier system (unit, platform integration, end-to-end, LLM evaluation) with explicit pytest markers/flags.
- Updated tracing docs to use the current CLI entrypoint and marker/flag-based E2E invocation.
Chores
- Added pgvector and psycopg dependencies to dev tooling.
- Revised test configuration so different test subsets are selected by markers rather than the prior env-var.

Added psycopg[binary]>=3.1 and pgvector>=0.3 to dev dependency group to ensure all unit tests can run during development and CI. This fixes the test collection error for test_postgres_backend.py while keeping postgres support optional for end users (via the pgvector optional dependency group).

coderabbitai · 2026-04-02T01:47:59Z

Warning

Rate limit exceeded

@visahak has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 1 minutes and 2 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 1 minutes and 2 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f11fbaf5-84d5-40e0-a208-7f1de82a7ad0

📥 Commits

Reviewing files that changed from the base of the PR and between 52f1ed3 and 079fbbd.

📒 Files selected for processing (4)

README.md
README_phoenix_sync.md
docs/LOW_CODE_TRACING.md
tests/e2e/test_e2e_pipeline.py

📝 Walkthrough

Walkthrough

Replaces the old phoenix marker/flag with a multi-tier pytest marker system (adding platform_integrations and e2e), updates CLI invocation paths in docs, and enhances the E2E pipeline with an autostarting Phoenix subprocess, stricter subprocess timeouts, and improved output/polling logic.

Changes

Cohort / File(s)	Summary
Docs `README.md`, `docs/LOW_CODE_TRACING.md`	Rewrote test-run docs to describe 4-tier test organization and pytest markers (`unit`, `platform_integrations`, `e2e`, `llm`); switched CLI examples from `python -m evolve.frontend.cli.cli` to `python -m evolve.cli` / `evolve ...`.
Config & Test Harness `pyproject.toml`, `tests/conftest.py`	Added dev deps (`pgvector>=0.3`, `psycopg[binary]>=3.1`); changed pytest default markexpr from `not phoenix and not llm` → `not llm and not e2e`; removed `phoenix` marker, added `platform_integrations`; replaced `--run-phoenix` flag with `--run-e2e` and updated markexpr override logic.
E2E Tests `tests/e2e/test_e2e_pipeline.py`	Added session-scoped `phoenix_server` fixture that probes localhost:6006 and conditionally launches a Phoenix subprocess; improved sync output capture and polling loop; added timeouts (90s agent run, 30s trace check); switched sync invocation to `evolve sync phoenix ...`; extended process termination wait to 10s.
Unit Test Marker Migration `tests/unit/test_cli.py`, `tests/unit/test_extract_trajectories.py`, `tests/unit/test_mcp_server.py`, `tests/unit/test_phoenix_sync.py`, `tests/unit/test_tracing.py`	Reassigned tests previously marked `phoenix` to `unit` (module-level `pytestmark` or decorator adjustments) to align with new marker taxonomy.
Unit Fixture Adjustment `tests/unit/test_client.py`	Replaced bare `EvolveClient()` construction in shared fixture with `EvolveConfig(backend="filesystem")` passed into `EvolveClient(config=...)`.

Sequence Diagram(s)

sequenceDiagram
  participant Runner as Test Runner
  participant Conftest as pytest conftest
  participant Phoenix as Phoenix Server (local)
  participant CLI as evolve CLI / subprocess
  participant Agent as Agent subprocess
  participant Verifier as Trace verifier subprocess
  Note over Runner,Conftest: Test session start
  Runner->>Conftest: parse flags (--run-e2e?)
  Conftest->>Runner: adjust markexpr (remove "not e2e")
  Runner->>Phoenix: request phoenix_server fixture
  Phoenix->>Phoenix: probe http://localhost:6006/status
  alt status reachable
    Phoenix-->>Runner: yield base URL
  else not reachable
    Phoenix->>CLI: run "evolve sync phoenix ..." (subprocess)
    CLI->>Phoenix: emit stdout lines (polled)
    Phoenix->>CLI: poll until ready or timeout
    CLI-->>Phoenix: yield base URL
  end
  Runner->>Agent: run agent (timeout 90s)
  Agent-->>Runner: produce output
  Runner->>Verifier: run trace verification (timeout 30s)
  Verifier-->>Runner: success/fail
  Runner->>Phoenix: teardown (wait 10s, kill if needed)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Rename Kaizen to Evolve #108 — Overlapping changes to CLI entrypoints and test invocation references (similar migration of module paths and test config).

Suggested reviewers

vinodmut

Poem

🐰 Hopping through markers, tidy and new,
I start Phoenix if it hides from view,
Timeouts watch tails, output I store,
Syncs and traces hum forevermore —
Tests leap onward, neat and true! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 71.43% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and specifically describes the main change: reorganizing the test suite into four distinct execution tiers (unit, platform_integrations, llm, and e2e).

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (4)

docs/LOW_CODE_TRACING.md (1)

249-250: Add -v to the documented pytest commands for consistency with repo test-run conventions.

This is a docs-quality consistency improvement for test output clarity.

Proposed docs tweak

-uv run pytest -m e2e --run-e2e -s
+uv run pytest -v -m e2e --run-e2e -s

-uv run pytest tests/e2e/test_e2e_pipeline.py -k smolagents -m e2e --run-e2e -s
+uv run pytest -v tests/e2e/test_e2e_pipeline.py -k smolagents -m e2e --run-e2e -s

-uv run pytest tests/e2e/test_e2e_pipeline.py -k openai_agents -m e2e --run-e2e -s
+uv run pytest -v tests/e2e/test_e2e_pipeline.py -k openai_agents -m e2e --run-e2e -s

Based on learnings: Run pytest verbosely with the `-v` flag by default when testing.

Also applies to: 258-262

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/LOW_CODE_TRACING.md` around lines 249 - 250, Update the documented
pytest commands (e.g., the occurrences of "uv run pytest -m e2e --run-e2e -s"
and other pytest command examples in the same section) to include the -v flag so
they read "uv run pytest -v -m e2e --run-e2e -s" (and similarly add -v to the
other pytest command examples mentioned around the block) to match repository
test-run conventions and produce verbose test output.

tests/e2e/test_e2e_pipeline.py (3)

21-22: Move imports to the top of the file.

urllib.request and urllib.error are imported after the configuration block. Per Python style conventions (and Ruff enforcement), all imports should be grouped at the top of the file.

♻️ Suggested fix

Move these imports to join the other imports at the top:

 import subprocess
 import time
 import re
 import os
 import datetime
+import urllib.request
+import urllib.error
 import pytest
 from evolve.config.phoenix import phoenix_settings

Then remove lines 21-22.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/test_e2e_pipeline.py` around lines 21 - 22, Move the late imports
urllib.request and urllib.error to the module import block at the top of
tests/e2e/test_e2e_pipeline.py so they join the other imports; remove the
duplicate import statements from their current position (the lines currently
importing urllib.request and urllib.error) and run the linter to ensure import
grouping/order is correct (references: the urllib.request and urllib.error
imports).

198-200: text=True and universal_newlines=True are redundant.

These parameters are aliases for the same behavior. Having both is harmless but unnecessary.

🧹 Suggested fix

     process = subprocess.Popen(
         sync_command,
         stdout=subprocess.PIPE,
         stderr=subprocess.STDOUT,  # Merge stderr to monitor everything
         text=True,
         bufsize=1,
-        universal_newlines=True,
     )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/test_e2e_pipeline.py` around lines 198 - 200, The call sets both
text=True and universal_newlines=True which are redundant aliases; remove
universal_newlines=True (keep text=True for clarity) in the subprocess
invocation (the call that also sets bufsize=1) so only one of these flags
remains.

11-13: Remove unused TIMESTAMP variable.

TIMESTAMP is defined here but never used—the test generates current_timestamp locally at line 102. This is dead code that should be removed.

🧹 Suggested fix

 # Configuration
 PHOENIX_URL = phoenix_settings.url
-# Use a session-scope timestamp or generate per test?
-# Per-test ensures no collisions even if run in parallel (though these should satisfy sequential)
-TIMESTAMP = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/test_e2e_pipeline.py` around lines 11 - 13, Remove the unused
top-level TIMESTAMP variable: delete the TIMESTAMP =
datetime.datetime.now().strftime("%Y%m%d_%H%M%S") declaration so the module no
longer contains dead code; tests already generate current_timestamp locally (see
current_timestamp usage in the test function), so no other changes are required.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/LOW_CODE_TRACING.md`:
- Around line 203-205: Replace the documented CLI invocation "python -m
evolve.cli" with the entry-point command "evolve" wherever it appears (e.g., the
example command lines showing "uv run python -m evolve.cli sync phoenix ...");
the project exposes an "evolve" script in pyproject.toml and there is no
evolve.cli __main__.py, so update those occurrences (also the other instance
around lines 212–213) to use "uv run evolve" instead.

In `@README.md`:
- Around line 121-131: Update the "Unit Tests (Default)" section to accurately
reflect pytest defaults: either state that `uv run pytest` runs the default
selection (`-m "not llm and not e2e"` and therefore includes
`platform_integrations`), or add/replace the unit-only example with `uv run
pytest -m unit`; edit the heading/content mentioning "Unit Tests (Default)" and
the example command strings so readers know the difference between the default
`uv run pytest` behavior and the explicit unit-only `uv run pytest -m unit`.

In `@tests/e2e/test_e2e_pipeline.py`:
- Around line 24-31: The phoenix_server fixture currently yields
"http://localhost:6006" but tests read PHOENIX_URL from phoenix_settings.url,
causing a possible mismatch; update the fixture phoenix_server so tests actually
use its URL by either removing autouse=True and having tests accept
phoenix_server as a parameter, or keep autouse and set phoenix_settings.url (or
PHOENIX_URL) to the yielded URL inside phoenix_server before yield, or update
tests to accept phoenix_server and use that returned URL instead of
phoenix_settings.url; locate the phoenix_server fixture and phoenix_settings.url
/ PHOENIX_URL references to implement one of these fixes consistently.

---

Nitpick comments:
In `@docs/LOW_CODE_TRACING.md`:
- Around line 249-250: Update the documented pytest commands (e.g., the
occurrences of "uv run pytest -m e2e --run-e2e -s" and other pytest command
examples in the same section) to include the -v flag so they read "uv run pytest
-v -m e2e --run-e2e -s" (and similarly add -v to the other pytest command
examples mentioned around the block) to match repository test-run conventions
and produce verbose test output.

In `@tests/e2e/test_e2e_pipeline.py`:
- Around line 21-22: Move the late imports urllib.request and urllib.error to
the module import block at the top of tests/e2e/test_e2e_pipeline.py so they
join the other imports; remove the duplicate import statements from their
current position (the lines currently importing urllib.request and urllib.error)
and run the linter to ensure import grouping/order is correct (references: the
urllib.request and urllib.error imports).
- Around line 198-200: The call sets both text=True and universal_newlines=True
which are redundant aliases; remove universal_newlines=True (keep text=True for
clarity) in the subprocess invocation (the call that also sets bufsize=1) so
only one of these flags remains.
- Around line 11-13: Remove the unused top-level TIMESTAMP variable: delete the
TIMESTAMP = datetime.datetime.now().strftime("%Y%m%d_%H%M%S") declaration so the
module no longer contains dead code; tests already generate current_timestamp
locally (see current_timestamp usage in the test function), so no other changes
are required.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1d353ec9-61a8-4fba-933d-d0573b8caf62

📥 Commits

Reviewing files that changed from the base of the PR and between 824e4d9 and 89b7d63.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (10)

README.md
docs/LOW_CODE_TRACING.md
pyproject.toml
tests/conftest.py
tests/e2e/test_e2e_pipeline.py
tests/unit/test_cli.py
tests/unit/test_extract_trajectories.py
tests/unit/test_mcp_server.py
tests/unit/test_phoenix_sync.py
tests/unit/test_tracing.py

💤 Files with no reviewable changes (1)

tests/unit/test_cli.py

docs/LOW_CODE_TRACING.md

README.md

tests/e2e/test_e2e_pipeline.py

…ig in unit test fixture

…admes

visahak added 2 commits April 1, 2026 16:41

test: restructure test suite into four distinct execution tiers

89b7d63

coderabbitai bot reviewed Apr 2, 2026

View reviewed changes

docs/LOW_CODE_TRACING.md Outdated Show resolved Hide resolved

README.md Outdated Show resolved Hide resolved

tests/e2e/test_e2e_pipeline.py Show resolved Hide resolved

visahak added 3 commits April 1, 2026 22:11

fix(tests): resolve linting errors and force isolated filesystem conf…

52f1ed3

…ig in unit test fixture

docs: standardize cli commands to use evolve entrypoint across all re…

187ef5d

…admes

test: dynamically inject E2E phoenix fixture URL

079fbbd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: restructure test suite into four distinct execution tiers#128

test: restructure test suite into four distinct execution tiers#128
visahak wants to merge 5 commits intoAgentToolkit:mainfrom
visahak:test/restructure-test-suite

visahak commented Apr 2, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 2, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

visahak commented Apr 2, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

visahak commented Apr 2, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 2, 2026 •

edited

Loading