Skip to content

feat(monitoring): wire HealthCollector state-change events to workflow trigger + notification pipeline#3415

Merged
mrveiss merged 2 commits intoDev_new_guifrom
issue-3404
Apr 3, 2026
Merged

feat(monitoring): wire HealthCollector state-change events to workflow trigger + notification pipeline#3415
mrveiss merged 2 commits intoDev_new_guifrom
issue-3404

Conversation

@mrveiss
Copy link
Copy Markdown
Owner

@mrveiss mrveiss commented Apr 3, 2026

Summary

  • health_collector.py: tracks per-service last-known state; publishes {"service", "prev_state", "new_state", "error_context"} to autobot:services:{name}:state_change on real state transitions only (first-observation and no-change are suppressed); Redis failures are caught and logged, never propagated
  • notification_service.py: added SERVICE_FAILED = "service_failure" to NotificationEvent enum with default template
  • workflow_templates/service_health_monitor.yaml: new template — REDIS_PUBSUB trigger on autobot:services:*:state_change, filters on failed/crash-loop, sends notification
  • docs/examples/service_failure_monitoring.py: runnable standalone demo using redis.pubsub().psubscribe() + NotificationService
  • docs/user/guides/workflows.md: added "Monitor a Linux Service" section
  • Co-located unit tests: 7 cases covering first-observation suppression, no-change suppression, transitions, payload contents, Redis-unavailable resilience

Closes #3404

…w trigger + notification pipeline (#3404)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@mrveiss
Copy link
Copy Markdown
Owner Author

mrveiss commented Apr 3, 2026

Code review

Found 2 issues.

  1. State-change detection runs on a partial service list after timeout or discovery exceptions (correctness bug)

In discover_all_services, the call to _detect_and_publish_state_changes(services) is placed after the except block, so it executes even when subprocess.TimeoutExpired or a generic Exception interrupts the systemctl parse loop. When a timeout fires mid-parse, services contains only the units processed before the cutoff. Those units have their _last_known_status updated and may fire spurious transition events; more importantly, units that were observed in the previous cycle but are absent from the truncated list are silently ignored rather than treated as "not seen", so _last_known_status drifts from reality on every timeout. On a node where systemctl list-units is slow (common under load), this will happen frequently.

self._get_service_details(service_info["name"])
)
error_ctx = self._get_error_context(service_info["name"])
if error_ctx:
service_info["error_message"] = error_ctx
services.append(service_info)
except subprocess.TimeoutExpired:
logger.warning("Timeout discovering services")
except FileNotFoundError:
logger.warning("systemctl not found - not a systemd system")
except Exception as e:

  1. NotificationEvent.SERVICE_FAILED member name is inconsistent with its string value (API contract mismatch)

Every other enum member has a name that matches its value root: WORKFLOW_FAILED = "workflow_failed", STEP_FAILED = "step_failed", etc. The new member breaks this convention: it is named SERVICE_FAILED but has value "service_failure". Code that round-trips through NotificationEvent("service_failure") gets back a member whose .name is SERVICE_FAILED — a mismatch that will surprise anyone switching on .name vs .value, and makes the YAML template (event: service_failure) inconsistent with the Python symbol. The member should be either SERVICE_FAILED = "service_failed" or SERVICE_FAILURE = "service_failure".

SERVICE_FAILED = "service_failure"


Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

…ist; fix SERVICE_FAILED enum value

- Early-return on TimeoutExpired/FileNotFoundError/Exception before calling
  _detect_and_publish_state_changes to prevent false transitions on truncated lists
- SERVICE_FAILED value corrected from "service_failure" to "service_failed" to match
  name/value convention (WORKFLOW_FAILED="workflow_failed", STEP_FAILED="step_failed")
- Updated test assertion and workflow template event key to match

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@mrveiss mrveiss merged commit 34ed0c9 into Dev_new_gui Apr 3, 2026
4 of 5 checks passed
@mrveiss mrveiss deleted the issue-3404 branch April 3, 2026 20:51
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 3, 2026

✅ SSOT Configuration Compliance: Passing

🎉 No hardcoded values detected that have SSOT config equivalents!

@mrveiss
Copy link
Copy Markdown
Owner Author

mrveiss commented Apr 3, 2026

Code review

Found 1 issue (both issues from the prior review are already fixed in the latest commit).

Fixed since the prior review:

  • SERVICE_FAILED = "service_failure"SERVICE_FAILED = "service_failed" (value now matches name root — consistent with all other members).
  • Partial-list guard: each exception path in discover_all_services now does return services before _detect_and_publish_state_changes is called — spurious events on timeout/parse-error are prevented.

Remaining issue — TestPublishStateChange patch target does not exist in module namespace (correctness bug, score 90)

_publish_state_change imports get_redis_client with a deferred local import inside the method body:

# health_collector.py — inside _publish_state_change
try:
    from autobot_shared.redis_client import get_redis_client
    client = get_redis_client(database="main")

get_redis_client is never bound into the slm.agent.health_collector module namespace. Every test in TestPublishStateChange patches "slm.agent.health_collector.get_redis_client", but because that name does not exist at module level, unittest.mock.patch raises AttributeError at context-manager entry — the entire test class errors out in CI.

Fix: move the import to module level (preferred — consistent with autobot_shared usage everywhere else), or patch "autobot_shared.redis_client.get_redis_client" if the deferred import is intentional (e.g. to tolerate the package being absent on the SLM host).

from autobot_shared.redis_client import get_redis_client
client = get_redis_client(database="main")

def test_publishes_to_correct_channel(self):
collector = _make_collector()
mock_redis = self._mock_redis()
with patch(
"slm.agent.health_collector.get_redis_client", return_value=mock_redis
):
collector._publish_state_change("nginx", "running", "failed", "")
expected_channel = _STATE_CHANGE_CHANNEL_TEMPLATE.format(service="nginx")
call_args = mock_redis.publish.call_args[0]


Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant