Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
File renamed without changes.
3 changes: 3 additions & 0 deletions .claude/sessions/__live_session__/.lock
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
task_execution_id: exec-session-20260209-211636
timestamp: 2026-02-09T21:16:36-05:00
pid: orchestrator
156 changes: 156 additions & 0 deletions .claude/sessions/__live_session__/execution_context.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
# Execution Context

## Project Patterns
- Output/data types use `@dataclass` (not Pydantic BaseModel). Examples: `AgentResult`, `ContextState`, `CompactionResult`, `UsageRecord`, `TokenUsage`, `CostBreakdown`.
- Configuration types use `pydantic.BaseModel` (e.g., `AgentConfig`, `CompactionConfig`).
- All modules use `from __future__ import annotations` at the top.
- Google-style docstrings on all public classes and methods.
- Dataclass fields with mutable defaults use `field(default_factory=...)`.
- Test files use class-based test organization (`class TestX:`) with `-> None` return annotations.
- Tests import directly from the module being tested, not from top-level `__init__.py`.
- Line length is 100 chars (ruff config).
- Error hierarchies follow prompts/errors.py pattern: standalone `Exception` base class, contextual attributes, Google-style docstrings, `__reduce__` for pickle support.
- Pydantic `field_validator` does NOT run on default values — use `model_validator(mode="after")` for post-init processing of defaults.
- `Callable` import: use `from collections.abc import Callable` (ruff UP035 rule).
- `asyncio_mode = "auto"` in pyproject.toml — no `@pytest.mark.asyncio` decorators needed.

## Key Decisions
- Prior session used dataclasses for `MessageStats`, `ToolCallInfo`, `Turn` -- consistent with project convention for read-only output types.
- `SkillInfo` is a `@dataclass` (output type), `Skill`/`SkillConfig` are Pydantic `BaseModel` (config types).
- `SubagentConfig` is Pydantic `BaseModel`, `SubagentResult`/`DelegationHandle` are `@dataclass`.
- Enums use `(str, Enum)` pattern (not `StrEnum`) for config serialization compatibility.
- `_is_subagent` uses Pydantic `PrivateAttr` — not settable via constructor kwargs, not in `model_dump()`.
- Loader modules follow: `_read_file()` → `_split_*()` → `_parse_*()` → `_validate_*()` → `_map_*()` pattern.
- Validation uses non-throwing pattern — all issues returned as structured `ValidationResult`.
- `parent_agent.override(model=...)` does NOT affect spawned subagents — tests must mock `_spawn` to return pre-configured test agents.
- `UsageTracker.get_total_usage()` returns a mutable `TokenUsage` object — tests must capture scalar values before mutations.
- `_UsageTrackingHandle` extends `DelegationHandle` (dataclass) — must call `super().__init__()` with all required fields.
- Agent constructor: when `config` is passed, top-level `system_prompt` kwarg is ignored (config wins).
- Explicit init pattern: `init_skills()`/`init_subagents()` replace lazy property init. Properties raise `AttributeError` if not initialized. Double-init is idempotent (no-op). Subagents blocked from init.
- `invoke_skill()` is `async def`. Use `invoke_skill_sync()` for sync contexts. `invoke_skill_sync()` short-circuits non-fork skills via `SkillManager.activate()` directly.
- `activate_with_fork()` is `async def` — uses `await subagent_manager.delegate()` directly. No more ThreadPoolExecutor/asyncio.run bridging.
- TYPE_CHECKING imports used for skill types in core.py to avoid circular imports; runtime imports are lazy inside methods.
- Coverage command needs `--cov=src/mamba_agents/skills` (path-based) not `--cov=mamba_agents.skills` (module-based).
- MCP __init__.py pattern is reference for package exports (alphabetically sorted __all__, grouped imports).
- `SkillManager` no longer has a `subagent_manager` property. Fork-mode activation goes through `integration.activate_with_fork()`.
- `SubagentManager` accepts `skill_registry: SkillRegistry | None` (not `skill_manager: SkillManager`).
- The Agent facade mediates between both managers for fork-mode skills in `invoke_skill()`.
- `activate_with_fork()` now accepts `get_skill_fn: Callable[[str], Skill | None]` callback for circular detection.
- `activate_with_fork()` uses lazy imports for all subagent types to prevent circular deps.
- `patch("mamba_agents.skills.integration.activate_with_fork")` needed (not `manager.activate_with_fork`) because import is lazy inside method.
- invoke_skill pydantic-ai tool registered at end of `init_skills()` only when `len(self._skill_manager) > 0`.
- Tool description built at registration time listing currently available non-disabled skills. Invocation queries live registry at call time.
- pydantic-ai `_agent._function_toolset.tools` dict keyed by name — pop existing tool before re-registering to avoid `UserError`.
- `register_skill()` calls `_register_invoke_skill_tool()` to refresh tool description after adding skills.
- `deregister_skill()` removes invoke_skill tool entirely when no skills remain; `register_skill()` re-creates it when skills added to previously empty manager.
- Agent now has `deregister_skill(name)` method delegating to `skill_manager.deregister()`.
- `UsageTracker.record_subagent_usage(name, usage)` is the public API for subagent usage tracking (replaces direct `_subagent_totals` mutation).
- ReAct `final_answer` tool registration moved from `__init__()` to `run()` — tool only exists during workflow execution, cleaned up in `finally` block.
- pydantic-ai tools stored in `agent._function_toolset.tools` (dict keyed by tool name). Save/restore via `dict()` copy + `clear()`/`update()`.

## Known Issues
- No runtime type validation on dataclass fields (by design -- project uses dataclasses for output types, not input validation).
- skills/errors.py task #1 added `__reduce__` for pickle support; subagents/errors.py task #3 also has `__reduce__`. Ensure consistency across both modules.
- `ReActWorkflow` no longer registers `final_answer` in `__init__` — code checking for tool presence after construction (without running) will not find it.
- pydantic-ai raises `UserError` when registering a tool with a name that already exists — must pop existing tool before re-registering.

## File Map
- `src/mamba_agents/agent/messages.py` - MessageQuery class, data models, filter/query, analytics, export
- `src/mamba_agents/agent/__init__.py` - Agent module exports
- `src/mamba_agents/__init__.py` - Top-level package exports
- `src/mamba_agents/agent/core.py` - Main Agent class. Now has `init_skills()`, `init_subagents()`, `has_skill_manager`, `has_subagent_manager`, `async invoke_skill()`, `invoke_skill_sync()`
- `src/mamba_agents/agent/display/` - Display module (Rich, Plain, HTML renderers)
- `pyproject.toml` - Package metadata, dependencies, tool configuration
- `src/mamba_agents/skills/__init__.py` - Skills package init
- `src/mamba_agents/skills/errors.py` - Skill error hierarchy (SkillError base + 5 subclasses)
- `src/mamba_agents/skills/config.py` - Skill enums (SkillScope, TrustLevel), data models (SkillInfo, Skill, SkillConfig, ValidationResult)
- `src/mamba_agents/subagents/__init__.py` - Subagents package init
- `src/mamba_agents/subagents/errors.py` - SubagentError hierarchy (6 classes)
- `src/mamba_agents/subagents/config.py` - SubagentConfig (BaseModel), SubagentResult (dataclass), DelegationHandle (dataclass)
- `src/mamba_agents/skills/loader.py` - SKILL.md parser (load_metadata, load_full)
- `src/mamba_agents/skills/registry.py` - SkillRegistry class (register/deregister/get/list/has)
- `src/mamba_agents/skills/validator.py` - Skill validation (validate, validate_frontmatter, resolve_trust_level, check_trust_restrictions)
- `src/mamba_agents/subagents/loader.py` - Subagent markdown config loader (load_subagent_config, discover_subagents)
- `src/mamba_agents/agent/config.py` - AgentConfig now has `_is_subagent` PrivateAttr
- `src/mamba_agents/config/settings.py` - AgentSettings now has `skills: SkillConfig | None`
- `src/mamba_agents/tokens/tracker.py` - UsageTracker now has `source` field, `get_subagent_usage()`, `record_subagent_usage()`
- `src/mamba_agents/skills/discovery.py` - scan_directory, discover_skills (three-level priority)
- `src/mamba_agents/skills/invocation.py` - InvocationSource enum, parse_arguments, substitute_arguments, activate, deactivate
- `src/mamba_agents/skills/manager.py` - SkillManager facade (discover, register/deregister, get/list, activate/deactivate, validate, get_tools)
- `src/mamba_agents/skills/testing.py` - SkillTestHarness class + skill_harness pytest fixture
- `src/mamba_agents/subagents/spawner.py` - spawn, _build_system_prompt, _resolve_tools, _enforce_no_nesting
- `src/mamba_agents/subagents/delegation.py` - delegate (async), delegate_sync, delegate_async
- `src/mamba_agents/subagents/manager.py` - SubagentManager facade (register/deregister/list/get, delegate/delegate_sync/delegate_async, spawn_dynamic, get_usage_breakdown)
- `src/mamba_agents/skills/integration.py` - Skills-subagents integration (fork activation, circular detection)
- `tests/conftest.py` - Shared fixtures including sample_skill_dir, sample_skill_info, sample_skill, sample_subagent_config, sample_agent_dir

- `src/mamba_agents/workflows/react/workflow.py` - Now has `_save_tool_state()`, `_restore_tool_state()`, `_register_final_answer_tool()` methods and overridden `run()` with try/finally cleanup
- `tests/unit/test_agent_invoke_skill_tool.py` - Tests for invoke_skill pydantic-ai tool (registration, description, permissions, errors, edge cases, integration, parameters, dynamic description)
- `tests/unit/test_regression_fragility_points.py` - 6 regression tests for resolved fragility points (skills in agent.run, async fork, no lazy init, tracker API, no circular init, ReAct cleanup)

## Task History
### Prior Sessions Summary
Previous execution sessions implemented: (1) message querying/analytics (MessageQuery, MessageStats, ToolCallInfo, Turn), display rendering (Rich/Plain/HTML renderers, DisplayPreset, standalone functions, snapshot tests), __init__.py exports; (2) skill error hierarchy (35 tests), skill data models/enums (30 tests), subagent error hierarchy (42 tests), subagent data models (22 tests), SKILL.md loader/parser (51 tests), skill registry (46 tests), skill validation/trust levels (48 tests), subagent config loader (30 tests), AgentConfig/AgentSettings extensions (17 tests), UsageTracker subagent aggregation (15 tests), skill discovery (37 tests), skill invocation (79 tests), subagent spawner (30 tests), SkillManager facade (55 tests), subagent delegation (49 tests), skills package __init__.py (18 tests), skills cross-cutting integration (36 tests), SkillTestHarness (39 tests), Agent skills facade (31 tests), SubagentManager facade (52 tests), subagents __init__.py (16 tests), subagent unit tests phase 2 (46 tests), Agent subagents facade (31 tests), Skills-Subagents bi-directional integration (38 tests). All tasks PASS. Full suite at 2322+ tests.

### Task [17]: Add record_subagent_usage() public method to UsageTracker - PASS
- Added `record_subagent_usage(name, usage)` to UsageTracker, refactored SubagentManager to use it
- `_UsageTrackingHandle` delegates to `_aggregate_usage()` which was the single refactor point
- 8 new tests, 23/23 passing in test_tracker.py

### Task [18]: Remove bidirectional dependency between SkillManager and SubagentManager - PASS
- Removed `subagent_manager` property/setter from SkillManager, changed SubagentManager to accept `skill_registry`
- Agent facade now mediates fork-mode skills via `integration.activate_with_fork()`
- Many integration tests rewritten for new mediator pattern. 2409 total tests passing.

### Task [19]: Add ReAct tool state save/restore around workflow run - PASS
- Moved `final_answer` registration from `__init__()` to `run()` with try/finally cleanup
- Added `_save_tool_state()`, `_restore_tool_state()`, `_register_final_answer_tool()` methods
- 10 new tests, 34/34 passing in test_react_workflow.py

### Task [20]: Replace lazy property initialization with explicit init_skills() and init_subagents() - PASS
- Replaced lazy property pattern with explicit `init_skills()`/`init_subagents()` methods
- Properties raise `AttributeError` when not initialized, with helpful guidance
- Double-init is idempotent, subagents blocked from init. 2424 tests passing.

### Task [22]: Redesign activate_with_fork() as fully async - PASS
- `activate_with_fork()` changed to `async def`, removed ThreadPoolExecutor/asyncio.run
- `invoke_skill()` now `async def`, added `invoke_skill_sync()` for sync contexts
- When mocking async methods, use `new_callable=AsyncMock`. 2424 tests passing.

### Task [27]: Update ReAct workflow tests for tool cleanup behavior - PASS
- No changes needed. Task #19 already handled all test updates. 132/132 workflow tests pass.

### Task [21]: Update Agent facade to wire managers through integration module - PASS
- No changes needed. Prior tasks #18, #20, #22 already wired Agent as composition root. Verified all criteria.

### Task [23]: Make Agent.invoke_skill() async and add invoke_skill_sync() wrapper - PASS
- No changes needed. Task #22 already implemented. Verified `invoke_skill()` async and `invoke_skill_sync()` sync wrapper exist.

### Task [24]: Register invoke_skill pydantic-ai tool when skills are initialized - PASS
- Files modified: `src/mamba_agents/agent/core.py` (added `_register_invoke_skill_tool` method, called from `init_skills()`), `tests/unit/test_agent_invoke_skill_tool.py` (new file, 24 tests)
- pydantic-ai `Tool` objects have `.function_schema.json_schema` for parameter inspection, `.description` for tool description, `.function` for the underlying callable
- `tool_plain()` registers async functions; pydantic-ai handles async tool execution
- `_agent._function_toolset.tools` dict keyed by name; pop before re-registering to avoid `UserError`
- Closures over `self` (as `agent_self`) work well for tool functions that need agent state
- `TestModel(call_tools='all')` automatically calls all registered tools during `run()` — useful for integration tests
- 2448 total tests passing (24 new)

### Task [26]: Update existing skill and subagent tests for new API signatures - PASS
- No changes needed. All tests already updated by prior tasks #17, #18, #20, #22, #23. 908 skill/subagent tests pass across 20+ test files.

### Task [25]: Update invoke_skill tool description dynamically when skills change - PASS
- Updated `register_skill()` to call `_register_invoke_skill_tool()` after registration. Added `deregister_skill()` method that removes tool when no skills remain.
- `_register_invoke_skill_tool()` already designed for idempotent re-registration — straightforward follow-on.
- 7 new tests, 2455 total passing.

### Task [28]: Write 6 regression tests for resolved fragility points - PASS
- New file: `tests/unit/test_regression_fragility_points.py` — 6 regression test classes, 13 test methods
- Covers: skills invocable during agent.run(), fork mode in async context, no lazy init side effects, UsageTracker public API, no circular initialization, ReAct tool cleanup
- Note: SubagentManager._aggregate_usage() still accesses _subagent_totals directly internally; the public API record_subagent_usage() exists but the manager uses its own private path. Regression test validates the public API works.
- 2468 total tests passing.

### Task [29]: Verify coverage targets and update CLAUDE.md fragility points - PASS
- Files modified: `tests/unit/test_workflows/test_react_workflow.py` (26 new tests), `CLAUDE.md` (fragility points + implementation notes)
- All 8 target modules >= 80%. Only `workflows/react/workflow.py` needed work (58% → 99%).
- Removed 5 resolved fragility points, updated 1 partially resolved, kept 3 unresolved.
- 2494 total tests passing.
30 changes: 30 additions & 0 deletions .claude/sessions/__live_session__/execution_plan.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Execution Plan

Tasks to execute: 13
Retry limit: 3 per task
Max parallel: 5 per wave

## WAVE 1 (3 tasks)
1. [#18] Remove bidirectional dependency between SkillManager and SubagentManager
2. [#17] Add record_subagent_usage() public method to UsageTracker
3. [#19] Add ReAct tool state save/restore around workflow run

## WAVE 2 (3 tasks)
4. [#20] Replace lazy property initialization with explicit init_skills() and init_subagents() — after [#18]
5. [#22] Redesign activate_with_fork() as fully async — after [#18]
6. [#27] Update ReAct workflow tests for tool cleanup behavior — after [#19]

## WAVE 3 (2 tasks)
7. [#21] Update Agent facade to wire managers through integration module — after [#18, #20]
8. [#23] Make Agent.invoke_skill() async and add invoke_skill_sync() wrapper — after [#22]

## WAVE 4 (2 tasks)
9. [#24] Register invoke_skill pydantic-ai tool when skills are initialized — after [#20, #22, #23]
10. [#26] Update existing skill and subagent tests for new API signatures — after [#17, #18, #20, #22, #23]

## WAVE 5 (2 tasks)
11. [#25] Update invoke_skill tool description dynamically when skills change — after [#24]
12. [#28] Write 6 regression tests for resolved fragility points — after [#17, #18, #19, #20, #22, #24]

## WAVE 6 (1 task)
13. [#29] Verify coverage targets and update CLAUDE.md fragility points — after [#26, #27, #28]
20 changes: 20 additions & 0 deletions .claude/sessions/__live_session__/progress.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Execution Progress
Status: Complete
Wave: 6 of 6 (all complete)
Max Parallel: 5
Updated: 2026-02-09T22:40:00-05:00

## Completed This Session (13/13 PASS)
- [#17] Add record_subagent_usage() public method to UsageTracker — PASS (3m 8s)
- [#19] Add ReAct tool state save/restore around workflow run — PASS (6m 41s)
- [#18] Remove bidirectional dependency between SkillManager and SubagentManager — PASS (12m 48s)
- [#27] Update ReAct workflow tests for tool cleanup behavior — PASS (1m 57s)
- [#20] Replace lazy property initialization with explicit init_skills() and init_subagents() — PASS (10m 38s)
- [#22] Redesign activate_with_fork() as fully async — PASS (11m 30s)
- [#23] Make Agent.invoke_skill() async and add invoke_skill_sync() wrapper — PASS (1m 31s)
- [#21] Update Agent facade to wire managers through integration module — PASS (1m 52s)
- [#24] Register invoke_skill pydantic-ai tool when skills are initialized — PASS (5m 59s)
- [#26] Update existing skill and subagent tests for new API signatures — PASS (4m 47s)
- [#25] Update invoke_skill tool description dynamically when skills change — PASS (1m 46s)
- [#28] Write 6 regression tests for resolved fragility points — PASS (2m 16s)
- [#29] Verify coverage targets and update CLAUDE.md fragility points — PASS (6m 32s)
17 changes: 17 additions & 0 deletions .claude/sessions/__live_session__/task_log.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Task Execution Log

| Task ID | Subject | Status | Attempts | Duration | Token Usage |
|---------|---------|--------|----------|----------|-------------|
| 17 | Add record_subagent_usage() public method to UsageTracker | PASS | 1/3 | 3m 8s | 91,075 |
| 19 | Add ReAct tool state save/restore around workflow run | PASS | 1/3 | 6m 41s | 102,624 |
| 18 | Remove bidirectional dependency between SkillManager and SubagentManager | PASS | 1/3 | 12m 48s | 71,161 |
| 27 | Update ReAct workflow tests for tool cleanup behavior | PASS | 1/3 | 1m 57s | 68,528 |
| 20 | Replace lazy property initialization with explicit init_skills() and init_subagents() | PASS | 1/3 | 10m 38s | 159,417 |
| 22 | Redesign activate_with_fork() as fully async | PASS | 1/3 | 11m 30s | 144,383 |
| 23 | Make Agent.invoke_skill() async and add invoke_skill_sync() wrapper | PASS | 1/3 | 1m 31s | 69,622 |
| 21 | Update Agent facade to wire managers through integration module | PASS | 1/3 | 1m 52s | 94,212 |
| 26 | Update existing skill and subagent tests for new API signatures | PASS | 1/3 | 4m 47s | 50,055 |
| 24 | Register invoke_skill pydantic-ai tool when skills are initialized | PASS | 1/3 | 5m 59s | 117,933 |
| 25 | Update invoke_skill tool description dynamically when skills change | PASS | 1/3 | 1m 46s | 82,754 |
| 28 | Write 6 regression tests for resolved fragility points | PASS | 1/3 | 2m 16s | 102,626 |
| 29 | Verify coverage targets and update CLAUDE.md fragility points | PASS | 1/3 | 6m 32s | 96,398 |
3 changes: 3 additions & 0 deletions .claude/sessions/exec-session-20260205/.lock
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
task_execution_id: exec-session-20260205-224311
timestamp: 2026-02-05T22:43:11Z
pid: orchestrator
Loading
Loading