Skip to content

PR: Implement Conductor State Retention and Garbage Collection#16

Open
0-robert wants to merge 1 commit intoRigos0:mainfrom
0-robert:feat/conductor-gc
Open

PR: Implement Conductor State Retention and Garbage Collection#16
0-robert wants to merge 1 commit intoRigos0:mainfrom
0-robert:feat/conductor-gc

Conversation

@0-robert
Copy link
Copy Markdown

PR: Implement Conductor State Retention and Garbage Collection

Overview

This PR implements a robust state retention and garbage collection (GC) system for the Conductor. It prevents the .superturtle/state/ directory from growing indefinitely by pruning stale terminal records and rotating append-only logs after a configurable retention window (default: 7 days).

Backlog Reference

  • Add conductor state retention/gc so .superturtle/state/ does not grow forever (Line 139 in CLAUDE.md).

Visual Architecture

flowchart TD
    subgraph Trigger["1. Trigger & Configuration"]
        CLI[("./ctl gc-state --max-age 7d")] --> Store["Init ConductorStateStore"]
    end

    subgraph Discovery["2. Safety Guard Discovery"]
        direction LR
        ScanW["Scan /workers/"] --> ActiveSet["Map 'Active Run IDs'<br/>(Lifecycle != archived)"]
        ScanWK["Scan /wakeups/"] --> PendingSet["Map 'Pending Run IDs'<br/>(State == pending|processing)"]
    end

    subgraph Engine["3. Surgical Pruning Engine"]
        direction TB
        subgraph Decisions["Safety Logic Gates"]
            direction LR
            D1{{"Prune Wakeup?"}}
            D2{{"Prune Inbox?"}}
            D3{{"Prune Worker?"}}
        end

        D1 -- "NOT Active" --> P1[("os.unlink(wakeups/)")]
        D2 -- "Age > Cutoff" --> P2[("os.unlink(inbox/)")]
        D3 -- "No Pending" --> P3[("os.unlink(workers/)")]
    end

    subgraph Rotation["4. Atomic Log Rotation"]
        direction LR
        Logs[("events.jsonl<br/>runs.jsonl")] --> Split["Filter Old Lines"]
        Split --> Archive["Append to .1"]
        Split --> Replace["Atomic replace()"]
    end

    Trigger --> Discovery
    Discovery --> Engine
    Engine --> Rotation
    Rotation --> Summary["GcResult Summary"]

    %% Styling
    style Trigger fill:#f9f0ff,stroke:#722ed1,stroke-width:2px
    style Discovery fill:#e6f7ff,stroke:#1890ff,stroke-width:2px
    style Engine fill:#fff7e6,stroke:#fa8c16,stroke-width:2px
    style Rotation fill:#f6ffed,stroke:#52c41a,stroke-width:2px
    style Decisions fill:#ffffff,stroke:#333,stroke-dasharray: 5 5
    style Summary fill:#fff1f0,stroke:#f5222d,stroke-width:4px
Loading

Core Implementation

  • Pruning Engine (state/conductor_gc.py): A resilient Python module that performs surgical unlinking of stale JSON records.
  • Safety Guards:
    • Active Run Protection: Never prunes records (even if old) associated with a non-archived run_id.
    • Pending Notification Guard: Worker records are strictly preserved until all pending or processing notifications for that run are delivered.
    • Atomic Log Rotation: Uses temporary files and os.replace to ensure zero log corruption even if the process is interrupted.
  • Data Resilience: Gracefully skips malformed JSON and non-UTF8 binary noise in log files using errors="replace" and resilient parsing.
  • CLI Integration: Added gc-state command to ctl with --dry-run and human-readable durations.

Testing & Verification

  • Automated (TDD): Added 33 behavioral tests in super_turtle/state/test_conductor_gc.py covering state transitions, safety gates, and timestamp precision.
  • Manual Verification: Completed 10 manual edge-case scenarios (documented in super_turtle/docs/gc/manual_testing.md) including binary log injection and interruption simulation.

New Documentation

  • super_turtle/docs/gc/planning.md: Detailed design and scope.
  • super_turtle/docs/gc/architecture.md: Visual technical reference.
  • super_turtle/docs/gc/manual_testing.md: Manual edge-case verification guide.

- Adds 'ctl gc-state' to prune stale wakeups, inbox items, and archived workers.
- Implements atomic log rotation for 'events.jsonl' and 'runs.jsonl'.
- Includes comprehensive safety guards for active runs and pending notifications.
- Adds 33 TDD behavioral tests and 10 manual edge-case verification guides.
- Organizes design, architecture (Mermaid), and test docs in 'docs/gc/'.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant