feat(observability): wire LLM + report worker metrics, auto-disable jobs#628
Conversation
Add Prometheus instruments and wiring for the full LLM resilience chain and report worker pipeline (PR 6 of llm-resilience-refactor-plan). Metrics added: - tradingagent_llm_retry_total (counter, by provider) - tradingagent_llm_budget_exhausted_total (counter) - tradingagent_report_worker_success_total (counter, by strategy_id) - tradingagent_report_worker_error_total (counter, by strategy_id) - tradingagent_report_staleness_seconds (histogram, by strategy_id) LLM package: - RetryMetrics interface + WithRetryMetrics() on RetryProvider - BudgetMetrics interface + WithBudgetMetrics() on BudgetGuardProvider - WithChainRetryMetrics/WithChainBudgetMetrics chain options - Emit retry metric on each backoff attempt - Emit budget exhaustion metric before rejecting call Runtime: - retryMetricsAdapter bridges provider label to RetryMetrics interface - chainOpts wires retry + budget metrics; report metrics wired on orch Automation: - ReportWorkerMetrics interface on JobOrchestrator - WithReportMetrics() setter (mirrors existing WithJobMetrics pattern) - Emit success/error per strategy in paperValidationReport loop - Auto-disable: jobs with >=5 consecutive failures set Enabled=false and log at ERROR level in both wrapAndRun and runDirect paths API: - ReportMetrics interface defined in report_handlers.go - ObserveReportStaleness emitted on GET /reports/latest - Deps.ReportMetrics wired through to Server Docs: - docs/reference/llm-resilience.md: chain diagram, env var table, emergency overrides, report worker ops, Prometheus metrics reference, troubleshooting playbooks (all providers down / budget exhausted / report stale / fallback storm), 7-day success criteria
There was a problem hiding this comment.
Pull request overview
This PR (6/6 of the LLM resilience refactor plan) wires Prometheus observability into the LLM provider chain and report worker pipeline, adds job auto-disable behavior after repeated failures, and ships an operational reference document.
Changes:
- Adds new Prometheus instruments + recording helpers for LLM retry/budget exhaustion, report worker success/error, and report staleness.
- Extends LLM chain layers (retry + budget guard) to emit metrics via new chain options and provider hooks.
- Adds report-worker metrics emission and auto-disable logic (≥5 consecutive failures) in the automation orchestrator; emits report staleness from the “latest report” API endpoint; wires metrics through runtime.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/metrics/metrics.go | Registers new Prometheus instruments and exposes convenience record/observe methods. |
| internal/llm/retry.go | Adds optional retry metrics hook and emits on each backoff attempt. |
| internal/llm/provider_chain.go | Adds chain options to wire retry/budget metrics into their respective layers. |
| internal/llm/budget.go | Adds optional budget exhaustion metrics hook before rejecting calls. |
| internal/automation/orchestrator.go | Adds report-worker metrics plumbing and auto-disable logic after consecutive failures. |
| internal/automation/jobs_reports.go | Emits per-strategy report worker success/error counters. |
| internal/api/server.go | Wires ReportMetrics dependency into the API server. |
| internal/api/report_handlers.go | Observes report staleness on GET .../reports/latest. |
| cmd/tradingagent/runtime.go | Wires metrics implementations into the LLM chain and report worker orchestrator. |
| docs/reference/llm-resilience.md | Adds operational reference documentation for chain behavior, metrics, and playbooks. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Complete checks budget before delegating. Records usage after success. | ||
| func (b *BudgetGuardProvider) Complete(ctx context.Context, req CompletionRequest) (*CompletionResponse, error) { | ||
| if !b.budget.reserveRequest() { | ||
| if b.metrics != nil { | ||
| b.metrics.RecordLLMBudgetExhausted() | ||
| } | ||
| return nil, ErrBudgetExhausted | ||
| } |
There was a problem hiding this comment.
Budget exhaustion metrics emission was added (RecordLLMBudgetExhausted() before returning ErrBudgetExhausted), but budget_test.go doesn’t cover this new behavior. Adding a test with a stub BudgetMetrics to assert it records exactly when reserveRequest() fails would help prevent silent regressions in observability.
| // Layer 3: retry | ||
| if cfg.maxAttempts > 1 { | ||
| retryOpts := []RetryOption{WithMaxAttempts(cfg.maxAttempts)} | ||
| if cfg.baseDelay > 0 { | ||
| retryOpts = append(retryOpts, WithBaseDelay(cfg.baseDelay)) | ||
| } | ||
| p = NewRetryProvider(p, logger, retryOpts...) | ||
| rp := NewRetryProvider(p, logger, retryOpts...) | ||
| if cfg.metrics.retry != nil { | ||
| rp = rp.WithRetryMetrics(cfg.metrics.retry) | ||
| } | ||
| p = rp | ||
| } | ||
|
|
||
| // Layer 4: throttle | ||
| if cfg.throttle > 0 { | ||
| p = NewThrottledProvider(p, cfg.throttle) | ||
| } | ||
|
|
||
| // Layer 5: per-call timeout | ||
| if cfg.callTimeout > 0 { | ||
| p = &timeoutProvider{inner: p, timeout: cfg.callTimeout} | ||
| } | ||
|
|
||
| // Layer 6 (outermost): budget guard | ||
| if cfg.budget != nil { | ||
| p = NewBudgetGuardProvider(p, cfg.budget) | ||
| bg := NewBudgetGuardProvider(p, cfg.budget) | ||
| if cfg.metrics.budget != nil { | ||
| bg = bg.WithBudgetMetrics(cfg.metrics.budget) | ||
| } | ||
| p = bg |
There was a problem hiding this comment.
New chain options (WithChainRetryMetrics, WithChainBudgetMetrics) are added and wired, but provider_chain_test.go doesn’t validate that these metrics are actually attached/invoked when the retry/budget layers are enabled. Consider adding tests that build a chain with these options and assert the corresponding stub metrics are called.
| if s.reportMetrics != nil { | ||
| s.reportMetrics.ObserveReportStaleness(id.String(), stale) | ||
| } |
There was a problem hiding this comment.
handleGetLatestReport now emits ObserveReportStaleness, but report_handlers_test.go doesn’t assert that the metrics hook is called (or that the staleness value passed matches the response). Adding a test with a stub ReportMetrics implementation would help ensure this observation doesn’t get accidentally removed or miscomputed.
| if cfg.RetryMaxAttempts > 1 { | ||
| opts = append(opts, llm.WithRetry(cfg.RetryMaxAttempts)) | ||
| if appMetrics != nil { | ||
| opts = append(opts, llm.WithChainRetryMetrics(&retryMetricsAdapter{m: appMetrics, provider: strings.TrimSpace(cfg.DefaultProvider)})) |
There was a problem hiding this comment.
WithChainRetryMetrics is wired using a retryMetricsAdapter that binds the label to cfg.DefaultProvider, but the retry layer wraps the fallback layer (retry → fallback). This means retries triggered after the chain has fallen back to the secondary provider will still be recorded under the primary provider label, which makes tradingagent_llm_retry_total{provider=...} misleading. Consider either (a) changing the metric/label to reflect “configured primary” rather than actual provider, or (b) restructuring the chain so retry is applied per-provider (e.g., wrap primary+secondary with retry before composing fallback) so the label matches reality.
| if cfg.RetryMaxAttempts > 1 { | |
| opts = append(opts, llm.WithRetry(cfg.RetryMaxAttempts)) | |
| if appMetrics != nil { | |
| opts = append(opts, llm.WithChainRetryMetrics(&retryMetricsAdapter{m: appMetrics, provider: strings.TrimSpace(cfg.DefaultProvider)})) | |
| // | |
| // The retry metric label is intentionally bound to the configured primary | |
| // provider, not the actual provider that may ultimately handle the call after | |
| // fallback. This keeps the label semantics explicit and avoids attributing | |
| // secondary-provider retries to the wrong runtime provider identity. | |
| if cfg.RetryMaxAttempts > 1 { | |
| opts = append(opts, llm.WithRetry(cfg.RetryMaxAttempts)) | |
| if appMetrics != nil { | |
| opts = append(opts, llm.WithChainRetryMetrics(&retryMetricsAdapter{ | |
| m: appMetrics, | |
| provider: fmt.Sprintf("configured_primary:%s", strings.TrimSpace(cfg.DefaultProvider)), | |
| })) |
| // ReportWorkerMetrics captures report worker success/error/staleness. | ||
| type ReportWorkerMetrics interface { | ||
| RecordReportWorkerSuccess(strategyID string) | ||
| RecordReportWorkerError(strategyID string) | ||
| ObserveReportStaleness(strategyID string, seconds float64) |
There was a problem hiding this comment.
ReportWorkerMetrics includes ObserveReportStaleness, but the automation package only uses reportMetrics for report-worker success/error emission (staleness is observed in the API handler). Keeping an unused method in this interface forces any alternative implementation to add an unrelated method. Consider splitting this into a narrower report-worker interface (success/error only) and keeping staleness observation on the API-side ReportMetrics interface.
| // ReportWorkerMetrics captures report worker success/error/staleness. | |
| type ReportWorkerMetrics interface { | |
| RecordReportWorkerSuccess(strategyID string) | |
| RecordReportWorkerError(strategyID string) | |
| ObserveReportStaleness(strategyID string, seconds float64) | |
| // ReportWorkerMetrics captures report worker success/error emission. | |
| type ReportWorkerMetrics interface { | |
| RecordReportWorkerSuccess(strategyID string) | |
| RecordReportWorkerError(strategyID string) |
| if job.ConsecutiveFailures >= autoDisableThreshold { | ||
| job.Enabled = false | ||
| o.logger.Error("automation: auto-disabled job after consecutive failures", | ||
| slog.String("job", job.Name), | ||
| slog.Int("consecutive_failures", job.ConsecutiveFailures), | ||
| ) | ||
| } |
There was a problem hiding this comment.
Auto-disable behavior is newly introduced here (disable job after autoDisableThreshold consecutive failures), but there’s no unit test asserting that Enabled flips to false once the threshold is reached (and stays disabled for scheduled runs). Adding a test in orchestrator_test.go for both the manual (runDirect) and scheduled (wrapAndRun) paths would prevent regressions.
| // RecordLLMRetry increments the retry counter for a given provider. | ||
| func (m *Metrics) RecordLLMRetry(provider string) { | ||
| if m == nil { | ||
| return | ||
| } | ||
| m.LLMRetryTotal.WithLabelValues(provider).Inc() | ||
| } | ||
|
|
||
| // RecordLLMBudgetExhausted increments the budget exhaustion counter. | ||
| func (m *Metrics) RecordLLMBudgetExhausted() { | ||
| if m == nil { | ||
| return | ||
| } | ||
| m.LLMBudgetExhaustedTotal.Inc() | ||
| } | ||
|
|
||
| // RecordReportWorkerSuccess increments the report success counter for a strategy. | ||
| func (m *Metrics) RecordReportWorkerSuccess(strategyID string) { | ||
| if m == nil { | ||
| return | ||
| } | ||
| m.ReportWorkerSuccessTotal.WithLabelValues(strategyID).Inc() | ||
| } | ||
|
|
||
| // RecordReportWorkerError increments the report error counter for a strategy. | ||
| func (m *Metrics) RecordReportWorkerError(strategyID string) { | ||
| if m == nil { | ||
| return | ||
| } | ||
| m.ReportWorkerErrorTotal.WithLabelValues(strategyID).Inc() | ||
| } | ||
|
|
||
| // ObserveReportStaleness records how stale a report is at query time. | ||
| func (m *Metrics) ObserveReportStaleness(strategyID string, seconds float64) { | ||
| if m == nil { | ||
| return | ||
| } | ||
| m.ReportStaleness.WithLabelValues(strategyID).Observe(seconds) | ||
| } |
There was a problem hiding this comment.
New metric instruments and convenience methods were added (RecordLLMRetry, RecordLLMBudgetExhausted, RecordReportWorkerSuccess/Error, ObserveReportStaleness), but internal/metrics/metrics_test.go doesn’t exercise them (nor validate the emitted series). Adding coverage similar to TestNewCounters / TestConvenienceMethods would catch naming/registration/label regressions.
| if attempt > 0 { | ||
| delay := r.backoffDelay(attempt - 1) | ||
|
|
||
| if r.metrics != nil { | ||
| r.metrics.RecordLLMRetry() | ||
| } |
There was a problem hiding this comment.
Retry metrics emission was added (r.metrics.RecordLLMRetry() on attempts > 0), but retry_test.go doesn’t currently assert that retries increment/record metrics. Consider adding a small stub RetryMetrics implementation in tests and verifying it’s called exactly once per backoff attempt when a retryable error occurs.
|
@copilot apply changes based on the comments in this thread |
Agent-Logs-Url: https://github.com/PatrickFanella/augr/sessions/3c3eb647-082d-4ab2-8e2d-72f44dbd924d Co-authored-by: PatrickFanella <61631520+PatrickFanella@users.noreply.github.com>
Implemented the review-thread follow-ups in commit |
Summary
PR 6 of 6 from
docs/design/llm-resilience-refactor-plan.md.Wires Prometheus observability into the LLM provider chain and report worker pipeline, adds auto-disable logic for misbehaving jobs, and adds an operational reference doc.
New Metrics
tradingagent_llm_retry_totaltradingagent_llm_budget_exhausted_totaltradingagent_report_worker_success_totaltradingagent_report_worker_error_totaltradingagent_report_staleness_secondsChanges
internal/llmRetryMetricsinterface +WithRetryMetrics()onRetryProvider— emits on each backoff attemptBudgetMetricsinterface +WithBudgetMetrics()onBudgetGuardProvider— emits before rejecting exhausted callsWithChainRetryMetrics/WithChainBudgetMetricschain options inprovider_chain.gointernal/metricsRecordLLMRetry,RecordLLMBudgetExhausted,RecordReportWorkerSuccess,RecordReportWorkerError,ObserveReportStaleness*Metricssatisfiesllm.BudgetMetrics,api.ReportMetrics, andautomation.ReportWorkerMetricsinterfacescmd/tradingagent/runtimeretryMetricsAdapterbridges provider label to the no-argRetryMetricsinterfacechainOptswires retry + budget metrics into the chainWithReportMetrics(appMetrics)called alongsideWithJobMetricsinternal/automationReportWorkerMetricsinterface onJobOrchestrator+WithReportMetrics()setterpaperValidationReportloopEnabledset tofalseand log at ERROR. Applied in bothwrapAndRunandrunDirect. Re-enable viaPUT /api/v1/automation/jobs/{name}/enable.internal/apiReportMetricsinterface inreport_handlers.goObserveReportStalenessemitted onGET /api/v1/strategies/{id}/reports/latestDeps.ReportMetricswired through toServerdocs/reference/llm-resilience.md(new)Verification