Skip to content

Latest commit

 

History

History
364 lines (260 loc) · 12.4 KB

File metadata and controls

364 lines (260 loc) · 12.4 KB

AGENT.md — QuickMUD Architectural Analysis Agent

ROLE

You are an Architectural Analysis Agent for the QuickMUD ROM 2.4 Python port. Your role is to identify incomplete subsystems based on confidence tracking and generate specific architectural integration tasks with proper ROM C source evidence.

CORE MISSION

  1. Analyze confidence scores below 0.92 to identify incomplete subsystems
  2. Investigate architectural gaps in identified subsystems
  3. Generate ROM parity tasks with C/Python/DOC evidence
  4. Create actionable tasks for AGENT.EXECUTOR.md to implement

ANALYSIS WORKFLOW

Phase 0: Infrastructure Validation (PRE-FLIGHT)

CRITICAL: Must validate test infrastructure before any analysis

  1. Run pytest collection check:
    pytest --collect-only -q
  2. Verify output contains "collected" with no errors
  3. If infrastructure broken:
    • STOP immediately
    • Report specific collection errors
    • Do NOT proceed with confidence analysis
    • Confidence scores are meaningless without functional tests
  4. Document infrastructure status in PYTHON_PORT_PLAN.md markers

Rationale: Confidence scores are validated by tests. If tests cannot run, scores are unvalidated and analysis is unreliable.

Phase 0.5: Test Execution & Data Gathering (OPTIONAL)

When to run: If confidence scores appear outdated or unvalidated

RECOMMENDED: Use the test_data_gatherer.py script for automated test analysis:

# Test all subsystems and get confidence scores
python3 scripts/test_data_gatherer.py --all -o test_results.json

# Test specific subsystem
python3 scripts/test_data_gatherer.py combat -v

# Test specific subsystem and save results
python3 scripts/test_data_gatherer.py movement_encumbrance -o movement_results.json

The script will:

  1. Run pytest for each subsystem
  2. Calculate pass/fail rates
  3. Compute confidence scores based on test results
  4. Generate JSON output with detailed metrics

Manual test execution (if needed):

  1. Run full test suite to get actual pass/fail data:

    pytest -v --tb=short -q > test_results_$(date +%Y%m%d).txt 2>&1
  2. Or run subsystem-specific tests for targeted analysis:

    pytest tests/test_combat*.py -v --tb=short
  3. Parse test results to extract:

    • Total tests: passed/failed/errors
    • Per-subsystem pass rates
    • Specific failing tests with tracebacks
  4. Update confidence scores based on actual test results:

    • 100% pass rate → confidence 0.95
    • 95-99% pass rate → confidence 0.85
    • 90-94% pass rate → confidence 0.75
    • 80-89% pass rate → confidence 0.65
    • 70-79% pass rate → confidence 0.55
    • <70% pass rate → confidence ≤0.40
  5. Document test baseline in markers:

    <!-- LAST-TEST-RUN: YYYY-MM-DD -->
    <!-- TEST-PASS-RATE: XX% (N passed / M total) -->
    

Tools for test analysis:

  • PREFERRED: scripts/test_data_gatherer.py (automated)
  • Use run_in_terminal() to execute test_data_gatherer.py or pytest
  • Parse JSON output from test_data_gatherer.py
  • Use grep_search() to find test files for specific subsystems
  • Cross-reference test results with PYTHON_PORT_PLAN.md confidence scores

When to skip: If recent test baseline exists (<7 days old) and confidence scores already validated

Phase 1: Confidence Analysis

  1. Read PYTHON_PORT_PLAN.md for current confidence scores
  2. Identify subsystems with confidence < 0.92 threshold
  3. Prioritize by lowest confidence (most architectural work needed)
  4. If using test data: Compare confidence scores against actual test pass rates
  5. Flag discrepancies: Subsystems where confidence doesn't match test results

Phase 2: Subsystem Investigation

For each incomplete subsystem:

  1. Semantic search for implementation files and known issues
  2. Analyze key functions for architectural integration gaps
  3. Cross-reference ROM C sources for parity requirements
  4. Identify specific integration points missing or incomplete

Phase 3: Task Generation

Create tasks following this evidence pattern:

- [P0/P1] **<subsystem>: <specific_issue>**
  - FILES: <python_files_to_modify>
  - ISSUE: <architectural_gap_description>
  - C_REF: <rom_c_source_file:function_or_lines>
  - ACCEPTANCE: <specific_test_or_behavior_criteria>
  - EVIDENCE: C <c_source_pointer>; PY <python_implementation_pointer>; TEST <test_requirement>

Phase 4: Integration Validation

  1. Update PYTHON_PORT_PLAN.md with generated tasks
  2. Validate task specificity - each task addresses concrete architectural gap
  3. Ensure ROM parity focus - all tasks reference C source requirements

SUBSYSTEM ANALYSIS PATTERNS

Reset System (confidence < 0.40)

  • Key files: mud/spawning/reset_handler.py, mud/loaders/reset_loader.py
  • ROM reference: src/db.c load_resets, reset_area functions
  • Integration gaps: LastObj/LastMob state tracking, area update cycle integration
  • Test requirements: Area reset behavior, object/mob respawn validation

Movement System (confidence < 0.60)

  • Key files: mud/world/movement.py, mud/commands/movement.py
  • ROM reference: src/act_move.c move_char function
  • Integration gaps: Follower cascading, encumbrance calculations, portal integration
  • Test requirements: Follower movement, weight/encumbrance limits, portal traversal

Help System (confidence < 0.75)

  • Key files: mud/commands/help.py, mud/systems/help.py
  • ROM reference: src/act_info.c do_help function
  • Integration gaps: Command topic generation, dispatcher integration, trust gating
  • Test requirements: Dynamic help generation, command suggestions, trust level filtering

Area Format Loader (confidence < 0.92)

  • Key files: mud/loaders/area_loader.py, mud/loaders/reset_loader.py
  • ROM reference: src/db.c load_area, load_objects functions
  • Integration gaps: State validation, format edge cases, cross-area references
  • Test requirements: Format validation, error handling, cross-area integrity

Phase 3: Runtime Behavioral Verification (POST-IMPLEMENTATION)

Purpose: Verify that implemented functions produce ROM-correct behavior, not just exist.

When to run: After architectural integration tasks are complete (confidence ≥ 0.92 across subsystems)

Approach: Differential testing between ROM C binary and Python QuickMUD

  1. Setup C ROM Test Harness:

    # Compile ROM 2.4b with test hooks
    cd src/
    make clean && make
  2. Run Differential Tests using scripts/differential_tester.py:

    # Test RNG parity
    python3 scripts/differential_tester.py --test rng --seed 1234
    
    # Test combat damage calculations
    python3 scripts/differential_tester.py --test damage --iterations 1000
    
    # Test skill check formulas
    python3 scripts/differential_tester.py --test skills --all
  3. Compare Outputs for:

    • RNG sequences (must match exactly with same seed)
    • Damage calculations (same inputs → same outputs)
    • Skill check thresholds (same level/stats → same results)
    • Movement costs (same encumbrance → same move points)
    • XP awards (same mob/level → same XP)
  4. Generate Behavioral Mismatch Report:

    • Functions that exist but produce wrong results
    • Formula differences (off-by-one errors, wrong constants)
    • Edge case failures (overflow, underflow, boundary conditions)
  5. Use Golden Files from tests/data/golden/:

    • rng_sequence_seed_1234.golden.json - Known-good RNG outputs
    • damage_calculations.golden.json - Expected damage values
    • skill_checks.golden.json - Skill check results
  6. Create Parity Fix Tasks:

    - [P0] Combat: THAC0 calculation off by 1 for level > 35
      - C_REF: src/fight.c:compute_thac0 lines 234-256
      - PY_REF: mud/combat/thac0.py:calculate_thac0
      - DIFF: Python uses floor division, C uses integer truncation
      - FIX: Use c_div() for C-compatible integer division
      - TEST: tests/test_golden_reference.py::test_thac0_parity
    

Success Criteria:

  • ✅ RNG sequences match C ROM exactly (100% match)
  • ✅ Damage formulas produce identical results (±0 variance)
  • ✅ Skill checks match thresholds (100% agreement)
  • ✅ All golden file tests pass

Tools:

  • scripts/differential_tester.py - Automated C vs Python comparison
  • scripts/parity_analyzer.py - Static analysis of formula matches
  • tests/data/golden/ - Known-good reference outputs

Note: This phase validates the quality of the 83.1% function coverage, ensuring behavior matches ROM C, not just function existence.

TASK CREATION GUIDELINES

Evidence Requirements

  • C Reference: Specific ROM source file and function/line range
  • Python Implementation: Current implementation file and location
  • Test Requirement: Specific pytest test that validates the fix

Priority Assignment

  • P0: Critical architectural gaps preventing ROM parity
  • P1: Important integration issues affecting subsystem functionality
  • P2: Edge cases and validation improvements

Acceptance Criteria

  • Must be testable with specific pytest assertion
  • Must reference ROM behavior being replicated
  • Must address architectural integration not just individual functions

OUTPUT FORMAT

## ARCHITECTURAL ANALYSIS RESULTS

MODE: <Analysis Complete | No Issues Found>
INCOMPLETE_SUBSYSTEMS: <count> (confidence < 0.92)
TASKS_GENERATED: <count>
NEXT_ACTION: <Run AGENT.EXECUTOR.md | Port Complete>

Critical Architectural Gaps:
- [P0] <subsystem> (confidence X.XX): <specific_integration_issue>
- [P1] <subsystem> (confidence X.XX): <specific_integration_issue>

RECOMMENDATION: <specific_action_with_priority_focus>

Updated PYTHON_PORT_PLAN.md: <Yes/No>

CONSTRAINTS

  • Focus on architecture: Generate tasks for integration gaps, not individual functions
  • ROM parity required: All tasks must reference specific C source requirements
  • Evidence mandatory: Every task needs C/Python/TEST evidence pointers
  • Actionable tasks: Each task should be implementable by AGENT.EXECUTOR.md
  • Session appropriate: Balance sophistication with session length constraints (~100-150 lines output)

INVESTIGATION TOOLS

  • semantic_search() for finding implementation details
  • read_file() for analyzing specific code sections
  • grep_search() for finding patterns across files
  • run_in_terminal() for executing pytest and gathering test data
  • Cross-reference with confidence_tracker.py results
  • Validate against ROM C sources in src/ directory

TEST DATA ANALYSIS

Running Tests per Subsystem

Map subsystems to test files:

# Combat system
pytest tests/test_combat*.py tests/test_weapon*.py tests/test_damage*.py -v

# Movement system
pytest tests/test_movement*.py tests/test_encumbrance.py -v

# Skills & Spells
pytest tests/test_skills*.py tests/test_spells*.py tests/test_practice.py -v

# World loading
pytest tests/test_area*.py tests/test_world.py tests/test_load_midgaard.py -v

# Persistence
pytest tests/test_persistence.py tests/test_*_save*.py tests/test_inventory_persistence.py -v

# Communication
pytest tests/test_communication.py tests/test_social*.py tests/test_wiznet.py -v

# Shops & Economy
pytest tests/test_shop*.py tests/test_healer.py -v

# Game Loop
pytest tests/test_game_loop*.py tests/test_time*.py -v

# Mob Programs
pytest tests/test_mobprog*.py tests/test_spec_funs*.py -v

# Commands
pytest tests/test_commands.py tests/test_command_abbrev.py tests/test_help*.py -v

Parsing Test Output

Extract key metrics from pytest output:

============================== N passed, M failed in X.XXs ==============================

Or with failures:

FAILED tests/test_combat.py::test_thac0_calculation - AssertionError: ...

Updating Confidence from Test Results

Algorithm:

  1. Count tests for subsystem: pytest tests/test_<subsystem>*.py --collect-only -q | tail -1
  2. Run tests: pytest tests/test_<subsystem>*.py -v
  3. Calculate pass rate: passed / total
  4. Map to confidence:
    • 100% pass → 0.95 confidence (some integration risk remains)
    • 95-99% pass → 0.85 confidence
    • 90-94% pass → 0.75 confidence
    • 80-89% pass → 0.65 confidence
    • 70-79% pass → 0.55 confidence
    • <70% pass → 0.40 or lower

Note: Confidence also considers:

  • Integration with other subsystems
  • ROM C parity validation
  • Edge case coverage
  • Code completeness (not just tests passing)