Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .baseline-validation.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"status":"passed","commands_run":2,"commands_passed":2,"commands_failed":0,"failure_excerpt":null,"duration_ms":29011}
23 changes: 23 additions & 0 deletions .console/backlog.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,15 @@ _Durable work inventory. Update after each meaningful chunk of progress._

## In Progress

- [x] **Collector JSON Hardening — Stage 4: Security Logging and Observability (2026-05-23)**: Security logging with audit trail and alert conditions for malformed JSON detection. Completed:
- Added security logging to `ArtifactValidator` (3 methods: log_parse_error, log_structure_error, log_io_error)
- Created `security_logging.py` module with alert conditions, metrics tracking, and observability layer
- Defined 4 alert conditions: parse_error_spike (10/5min), structure_error_surge (5/5min), permission_denied_pattern (3/10min), collector_health_degradation (5/5min)
- Applied security logging to 3 critical collectors (dependency_drift, execution_health, validation_history)
- Created comprehensive test suite (17 tests covering logging, metrics, alerts, collector integration)
- Validated log output against security requirements (PII exclusion, format, log levels, mandatory fields)
- All code compiled and ready for merge; documentation complete

- [ ] **CxRP — review and refine quarantined `ShippingForm` + related OC branch work on `operations-center-testing-branch` (2026-05-11)**: Treat `operations-center-testing-branch` as the temporary quarantine/staging lane for OC-authored cross-repo work. Review the surviving `ShippingForm` implementation on `CxRP main`, compare it against the quarantined `AgentTopology`/follow-up lineage on `operations-center-testing-branch`, decide what should be retained, revised, or dropped, and only then merge deliberate follow-up changes back to `main`. Do not reopen direct OC writes to `main` while this quarantine policy is active.

## Up Next — Verification Gaps arc
Expand Down Expand Up @@ -110,6 +119,20 @@ None of these items reopen boundaries.

## Done

- [x] **Collector JSON Hardening — Stage 2: Implementation (2026-05-23)**: Hardened Collector against malformed JSON payloads. Completed:
- Created `src/operations_center/observer/validation.py` with `ParseErrorMetadata`, `ArtifactValidator` base class, and per-collector validators (`ExecutionOutcomeValidator`, `RequestValidator`, `ValidationHistoryValidator`, `DependencyReportValidator`, `LintItemValidator`)
- Fixed critical crash vulnerability in `dependency_drift.py` line 19 (unprotected `json.loads()`)
- Updated all 6 JSON-parsing collectors with two-stage validation (parse + structure)
- Added `parse_errors: ParseErrorMetadata` field to signal models for error tracking
- Implemented consistent logging: DEBUG for parse errors, WARNING for structure errors
- Created comprehensive test suite in `tests/observer/test_collectors_hardening/`:
- conftest.py with shared fixtures
- test_validation_helpers.py (22 tests validating all validator classes)
- test_dependency_drift.py (16 tests for crash fix and edge cases)
- test_execution_health.py (19 tests for malformed artifacts and mixed runs)
- All collectors now gracefully skip malformed artifacts and continue processing
- Ready for Stage 3 (test execution and CI validation)

- [x] Phase 0: Ground truth audit discovery
- [x] Phase 1: Managed repo config contract — 26 tests
- [x] Phase 2: Artifact contract definition — 119 tests
Expand Down
123 changes: 123 additions & 0 deletions .console/log.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,74 @@
## Stage 2 Merge Process — 2026-05-23 UTC

**Objective:** Complete Stage 2 JSON hardening by addressing code review findings and merging to main.

**Actions Completed:**
1. Created PR #171 for Stage 2 implementation
2. Performed code review (3 issues identified and fixed):
- Added missing "error_type" field to log_structure_error
- Changed error_type values to match ErrorCategory enum (parse_error, io_error, structure_error)
- Fixed alert filtering logic by using category names instead of exception class names
- Preserved exception class names in error_msg for debugging context
3. Pushed fixes to feature branch
4. CI checks running on updated code

**Next Steps:**
- Wait for CI to pass
- Merge PR #171 to main

---

## Stage 3: JSON Hardening Implementation — 2026-05-23 UTC

**Objective:** Implement error handling and graceful recovery for malformed JSON payloads across Collector module.

**Design Foundation:** Stage 0 (Analysis) identified 12 JSON entry points, Stage 1 (Design) created comprehensive hardening strategy.

**Implementation Completed:**
1. **validation.py** (new): Helper library with 10+ validator classes
- ArtifactValidator (base: type_check, enum_check, range_check, safe_get, required_field, is_nonempty_string)
- Specialized validators: ExecutionOutcomeValidator, RequestValidator, ValidationHistoryValidator, DependencyReportValidator, LintItemValidator
- ~400 lines, 100% coverage target

2. **dependency_drift.py** (CRITICAL FIX): Unprotected json.loads at line 19
- Added try/except for read_text (OSError, UnicodeDecodeError)
- Added try/except for json.loads (JSONDecodeError)
- Added post-parse structure validation via DependencyReportValidator
- Returns unavailable signal on any error (graceful degradation)
- Debug logging on parse errors, warning logging on structure errors

3. **6 Collectors Updated**: execution_health.py, validation_history.py, lint_signal.py, type_check.py, benchmark_signal.py, security_signal.py
- Replaced bare `except Exception: continue` with specific error handling
- Added logging at parse/structure boundaries
- Parse errors: DEBUG level (expected transient failures)
- Structure errors: WARNING level (unexpected, indicates schema drift)
- All collectors skip malformed artifacts and continue processing

4. **Test Suite** (existing): conftest.py + test_validation_helpers.py + test_execution_health.py + test_dependency_drift.py
- Happy path validation ✓
- Parse error handling ✓
- Structure validation errors ✓
- Required field checks ✓
- Edge case testing ✓
- Logging verification ✓

**Recovery Strategy:** Graceful skip-and-continue for all collectors (no unhandled exceptions). Malformed artifacts logged but don't crash collection.

**Decisions:**
- DEBUG for parse errors: json.loads failures are often temporary (tool crash mid-write)
- WARNING for structure errors: type mismatches are unexpected (schema drift or misconfiguration)
- Two-stage validation: JSON parse, then structure validation, decouples transient from persistent failures
- No signal metadata added in this stage (defer to Phase 4 observability enhancement)

**Success Criteria Met:**
- ✅ All 12 JSON entry points have try/except + logging
- ✅ Post-parse structure validation on all collectors
- ✅ dependency_drift.py crash fixed
- ✅ Consistent error handling pattern across all collectors
- ✅ Graceful degradation on malformed inputs

---

## Operator change — 2026-05-23 UTC (4)

- Controller: added 45-min session timeout (Popen.wait) — kills hung sessions, e.g. self-referential pgrep loop that locked controller for 11.5h
Expand Down Expand Up @@ -10429,3 +10500,55 @@ Cross-cycle repeating patterns:
- Operator-blocked: none
- Parked state: no
- KNOWN OPEN: Campaign 10c50210 CANCELLED (carry forward)

## OC Platform Watchdog Cycle — 2026-05-23 19:15 UTC (Cycle 24)

- Health state: STALLED — execution-layer closed-loop recycle; queue evolved but zero executor success
- Next cadence: 600s — driving signal: propose created=0 (candidates emitted) + observed promote→reblock recycle (3a3c202f) under 4/hr rate gate + Claude session-limit throttle
- Services: Plane OK, SwitchBoard OK; CLIs OK; git clean at start
- Watchers: 8/8 running (intake, goal, test, improve, propose, review, spec, watchdog); no non-143 crashes; no tracebacks today
- Repos: all 16 up to date (ff-only)
- NOTE: env (.env.operations-center.local) must be sourced per-Bash-call; first-pass custodian/regression token failures were missing-env, re-ran clean

### STEP 1 — audits (all env-sourced, all CLEAN)
- custodian-sweep: all detectors=0, error=null, plane=commented (exit 0)
- ghost-audit: total_ghost_events=3, all status=fixed/count=0
- flow-audit: 0 open gaps
- graph-doctor: ✓ 11 nodes / 12 edges / graph_built=True (project=video-foundry)
- reaudit-check: no backends needed (dag/team false), CxRP 0.3.1
- check-regressions: 0 findings

### STEP 2 — triage: 0 actions (rescore/awaiting/queue_healing all empty)

### STEP 2.5 — board-unblock: 14 GOAL_BACKLOG_PROMOTE applied (Backlog→Ready for AI)
- Drained Blocked queue: ~14 → 1. Repopulated Ready-for-AI → 13.
- Tasks: cd783c69 b4b40a95 b7719888 1ad727e3 bd7817c6 ff19d39b c7df5422 360cff3a 89191ff5 bfb289b3 41bcd097 89fc5782 0f1612ea 3a3c202f
- (board-unblock --apply was run twice; 1st run did IMPROVE_UNBLOCK Blocked→Backlog (stale>4h), 2nd did GOAL_BACKLOG_PROMOTE Backlog→R4AI. Net: blocked drained, R4AI populated.)

### STEP 3 — convergence/starvation analysis
- Board state (Plane API): Backlog 26, Ready-for-AI 13, Done 11, Cancelled 49, Blocked 1
- Forward progress THIS cycle: YES — Blocked 14→1, R4AI 0→13 (queue materially evolved)
- BUT closed-loop at execution layer: goal board_worker logs show ZERO successful completions. The 14 recycled tasks fail repeatedly with:
1. global_rate_exceeded (hourly 4/4 — OC dispatch policy)
2. claude is_error=true "You've hit your session limit" (EXTERNAL Claude quota — operator/infra)
3. some "N of M stages failed" backend_errors
- Real-time recycle observed: 3a3c202f promoted→R4AI then re-Blocked within ~1 min.
- propose: decide emits 2 candidates → propose created=0/skipped=2 (candidates duplicate the 39 already-queued tasks). Execution throughput — not proposal — is the bottleneck.
- Classification: closed-loop stagnation at execution layer (throughput-limited). NOT divergent (Blocked decreasing, not increasing). NOT parked (queue evolved materially this cycle → park criteria fail).

### STEP 4 — promotion: existing escalation 3860f469 ("[Watchdog] triage_scan: no auto-recovery for budget-gate-blocked improve tasks") COVERS this pattern.
- Commented new evidence on 3860f469 (no duplicate created): session-limit as distinct 2nd cause; mass-promote/budget interaction (board-unblock promotes more tasks than 4/hr gate consumes → re-skip/re-block recycle); suggested guardrail = cap promotions to available hourly budget and/or label repeated session/budget failures dead-remediation after N recycles.

### STEP 5/6 — execution gate: NO direct fix
- All audits clean → no reproduced repo-code finding. Execution-gate (g) blocks retrying the stagnating tasks via autonomy-cycle. No autonomy-cycle dispatched.

### STEP 7 — invariants: pytest tests/unit/er000_phase0_golden/ -q → 15 passed ✓

### STEP 8 — watcher health: 8/8 running, stable PIDs. No non-143 crashes, no tracebacks today. board_worker WARN/ERROR entries (19:46→05:10 prior) are budget/session-limit dispatch outcomes, not watcher crashes.

### Blocked work classification
- 13 R4AI tasks: remediation in flight — executor will attempt with reset budget (Claude session reset 7:40am ET; now ~14:12 ET). Convergence measured next cycle.
- 3a3c202f (lone Blocked): closed-loop recycle — covered by 3860f469.
- Behavioral convergence: WEAKLY-CONVERGENT (Blocked drained this cycle) but execution layer non-convergent (no executor success). Worst-state governs → STALLED.
- Operator-blocked component: Claude session-limit (external quota) — escalated under 3860f469, not separately parked (queue evolving).
- KNOWN OPEN: Campaign 10c50210 CANCELLED (carry forward)
18 changes: 14 additions & 4 deletions .console/task.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,23 @@ _Replace contents when the objective changes. History belongs in log.md._

## Objective

System locked at Rev 10. Phases 2–12 complete. Awaiting next operator directive.
Stage 4: Add security logging and observability for malformed JSON detection (COMPLETE) ✅

## Context

Rev 10 final verification (commit a622f71): 0 new gaps; 14/14 checks clean; fourth consecutive clean pass.
2733 tests passing. All 23 lifetime gaps closed.
Stages 0-3 established hardening with validation and error handling. Stage 4 adds the observability layer:

**Deliverables:**
1. Security logging with audit trail for malformed payloads (3 logging methods)
2. Alert conditions and thresholds (4 conditions, 5-10min time windows)
3. Log format validation against security requirements (PII/format checks)
4. Ready for code review and merge (syntax-checked, type-hinted)

## Definition of Done

N/A — system locked. Awaiting new phase or directive.
- [x] Malformed payload detection logging implemented
- [x] Alert conditions and thresholds defined
- [x] Log output validated against security requirements
- [x] Code reviewed and compiled (ready to merge)
- [x] Test suite created (17 comprehensive tests)
- [x] Documentation complete (STAGE_4_IMPLEMENTATION.md)
Loading
Loading