feat(monitoring): wire HealthCollector state-change events to workflow trigger + notification pipeline by mrveiss · Pull Request #3415 · mrveiss/AutoBot-AI

mrveiss · 2026-04-03T20:33:05Z

Summary

health_collector.py: tracks per-service last-known state; publishes {"service", "prev_state", "new_state", "error_context"} to autobot:services:{name}:state_change on real state transitions only (first-observation and no-change are suppressed); Redis failures are caught and logged, never propagated
notification_service.py: added SERVICE_FAILED = "service_failure" to NotificationEvent enum with default template
workflow_templates/service_health_monitor.yaml: new template — REDIS_PUBSUB trigger on autobot:services:*:state_change, filters on failed/crash-loop, sends notification
docs/examples/service_failure_monitoring.py: runnable standalone demo using redis.pubsub().psubscribe() + NotificationService
docs/user/guides/workflows.md: added "Monitor a Linux Service" section
Co-located unit tests: 7 cases covering first-observation suppression, no-change suppression, transitions, payload contents, Redis-unavailable resilience

Closes #3404

…w trigger + notification pipeline (#3404) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

mrveiss · 2026-04-03T20:39:53Z

Code review

Found 2 issues.

State-change detection runs on a partial service list after timeout or discovery exceptions (correctness bug)

In discover_all_services, the call to _detect_and_publish_state_changes(services) is placed after the except block, so it executes even when subprocess.TimeoutExpired or a generic Exception interrupts the systemctl parse loop. When a timeout fires mid-parse, services contains only the units processed before the cutoff. Those units have their _last_known_status updated and may fire spurious transition events; more importantly, units that were observed in the previous cycle but are absent from the truncated list are silently ignored rather than treated as "not seen", so _last_known_status drifts from reality on every timeout. On a node where systemctl list-units is slow (common under load), this will happen frequently.

AutoBot-AI/autobot-slm-backend/slm/agent/health_collector.py

Lines 154 to 165 in b92a9f6

    
                               self._get_service_details(service_info["name"]) 
        
                           ) 
        
                           error_ctx = self._get_error_context(service_info["name"]) 
        
                           if error_ctx: 
        
                               service_info["error_message"] = error_ctx 
        
                       services.append(service_info) 
        
           except subprocess.TimeoutExpired: 
        
               logger.warning("Timeout discovering services") 
        
           except FileNotFoundError: 
        
               logger.warning("systemctl not found - not a systemd system") 
        
           except Exception as e:

NotificationEvent.SERVICE_FAILED member name is inconsistent with its string value (API contract mismatch)

Every other enum member has a name that matches its value root: WORKFLOW_FAILED = "workflow_failed", STEP_FAILED = "step_failed", etc. The new member breaks this convention: it is named SERVICE_FAILED but has value "service_failure". Code that round-trips through NotificationEvent("service_failure") gets back a member whose .name is SERVICE_FAILED — a mismatch that will surprise anyone switching on .name vs .value, and makes the YAML template (event: service_failure) inconsistent with the Python symbol. The member should be either SERVICE_FAILED = "service_failed" or SERVICE_FAILURE = "service_failure".

AutoBot-AI/autobot-backend/services/notification_service.py

Lines 73 to 74 in b92a9f6

	SERVICE_FAILED = "service_failure"

Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

…ist; fix SERVICE_FAILED enum value - Early-return on TimeoutExpired/FileNotFoundError/Exception before calling _detect_and_publish_state_changes to prevent false transitions on truncated lists - SERVICE_FAILED value corrected from "service_failure" to "service_failed" to match name/value convention (WORKFLOW_FAILED="workflow_failed", STEP_FAILED="step_failed") - Updated test assertion and workflow template event key to match Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions · 2026-04-03T20:56:34Z

✅ SSOT Configuration Compliance: Passing

🎉 No hardcoded values detected that have SSOT config equivalents!

mrveiss · 2026-04-03T21:04:20Z

Code review

Found 1 issue (both issues from the prior review are already fixed in the latest commit).

Fixed since the prior review:

SERVICE_FAILED = "service_failure" → SERVICE_FAILED = "service_failed" (value now matches name root — consistent with all other members).
Partial-list guard: each exception path in discover_all_services now does return services before _detect_and_publish_state_changes is called — spurious events on timeout/parse-error are prevented.

Remaining issue — TestPublishStateChange patch target does not exist in module namespace (correctness bug, score 90)

_publish_state_change imports get_redis_client with a deferred local import inside the method body:

# health_collector.py — inside _publish_state_change
try:
    from autobot_shared.redis_client import get_redis_client
    client = get_redis_client(database="main")

get_redis_client is never bound into the slm.agent.health_collector module namespace. Every test in TestPublishStateChange patches "slm.agent.health_collector.get_redis_client", but because that name does not exist at module level, unittest.mock.patch raises AttributeError at context-manager entry — the entire test class errors out in CI.

Fix: move the import to module level (preferred — consistent with autobot_shared usage everywhere else), or patch "autobot_shared.redis_client.get_redis_client" if the deferred import is intentional (e.g. to tolerate the package being absent on the SLM host).

AutoBot-AI/autobot-slm-backend/slm/agent/health_collector.py

Lines 317 to 319 in dc02d70

    
           from autobot_shared.redis_client import get_redis_client 
        
           client = get_redis_client(database="main")

AutoBot-AI/autobot-slm-backend/slm/agent/health_collector_state_change_test.py

Lines 113 to 121 in dc02d70

    
           def test_publishes_to_correct_channel(self): 
        
               collector = _make_collector() 
        
               mock_redis = self._mock_redis() 
        
               with patch( 
        
                   "slm.agent.health_collector.get_redis_client", return_value=mock_redis 
        
               ): 
        
                   collector._publish_state_change("nginx", "running", "failed", "") 
        
               expected_channel = _STATE_CHANGE_CHANNEL_TEMPLATE.format(service="nginx") 
        
               call_args = mock_redis.publish.call_args[0]

Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

feat(monitoring): wire HealthCollector state-change events to workflo…

b92a9f6

…w trigger + notification pipeline (#3404) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

mrveiss merged commit 34ed0c9 into Dev_new_gui Apr 3, 2026
4 of 5 checks passed

mrveiss deleted the issue-3404 branch April 3, 2026 20:51

This was referenced Apr 3, 2026

bug(slm-agent): get_redis_client deferred import in _publish_state_change breaks test patching #3422

Open

docs: PROMPT_MIDDLEWARE_GUIDE had wrong method names (on_on_ double prefix) — add doc linting to CI #3425

Open

github-actions bot mentioned this pull request Apr 3, 2026

feat(monitoring): wire HealthCollector state-change events to workflow trigger + notification pipeline #3404

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(monitoring): wire HealthCollector state-change events to workflow trigger + notification pipeline#3415

feat(monitoring): wire HealthCollector state-change events to workflow trigger + notification pipeline#3415
mrveiss merged 2 commits intoDev_new_guifrom
issue-3404

mrveiss commented Apr 3, 2026

Uh oh!

mrveiss commented Apr 3, 2026

Uh oh!

Uh oh!

github-actions bot commented Apr 3, 2026

Uh oh!

mrveiss commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

mrveiss commented Apr 3, 2026

Summary

Uh oh!

mrveiss commented Apr 3, 2026

Code review

Uh oh!

Uh oh!

github-actions bot commented Apr 3, 2026

✅ SSOT Configuration Compliance: Passing

Uh oh!

mrveiss commented Apr 3, 2026

Code review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant