Skip to content

feat(observability): wire LLM + report worker metrics, auto-disable jobs#628

Merged
PatrickFanella merged 2 commits into
mainfrom
feat/llm-resilience-observability
Apr 17, 2026
Merged

feat(observability): wire LLM + report worker metrics, auto-disable jobs#628
PatrickFanella merged 2 commits into
mainfrom
feat/llm-resilience-observability

Conversation

@PatrickFanella
Copy link
Copy Markdown
Owner

Summary

PR 6 of 6 from docs/design/llm-resilience-refactor-plan.md.

Wires Prometheus observability into the LLM provider chain and report worker pipeline, adds auto-disable logic for misbehaving jobs, and adds an operational reference doc.


New Metrics

Metric Type Labels Description
tradingagent_llm_retry_total Counter provider Retry attempts by provider
tradingagent_llm_budget_exhausted_total Counter Budget exhaustion rejections
tradingagent_report_worker_success_total Counter strategy_id Successful report generations
tradingagent_report_worker_error_total Counter strategy_id Failed report generations
tradingagent_report_staleness_seconds Histogram strategy_id Report age at query time

Changes

internal/llm

  • RetryMetrics interface + WithRetryMetrics() on RetryProvider — emits on each backoff attempt
  • BudgetMetrics interface + WithBudgetMetrics() on BudgetGuardProvider — emits before rejecting exhausted calls
  • WithChainRetryMetrics / WithChainBudgetMetrics chain options in provider_chain.go

internal/metrics

  • 5 new instruments registered on the private registry
  • New recording methods: RecordLLMRetry, RecordLLMBudgetExhausted, RecordReportWorkerSuccess, RecordReportWorkerError, ObserveReportStaleness
  • *Metrics satisfies llm.BudgetMetrics, api.ReportMetrics, and automation.ReportWorkerMetrics interfaces

cmd/tradingagent/runtime

  • retryMetricsAdapter bridges provider label to the no-arg RetryMetrics interface
  • chainOpts wires retry + budget metrics into the chain
  • WithReportMetrics(appMetrics) called alongside WithJobMetrics

internal/automation

  • ReportWorkerMetrics interface on JobOrchestrator + WithReportMetrics() setter
  • Emit success/error per strategy in the paperValidationReport loop
  • Auto-disable: jobs with ≥5 consecutive failures have Enabled set to false and log at ERROR. Applied in both wrapAndRun and runDirect. Re-enable via PUT /api/v1/automation/jobs/{name}/enable.

internal/api

  • ReportMetrics interface in report_handlers.go
  • ObserveReportStaleness emitted on GET /api/v1/strategies/{id}/reports/latest
  • Deps.ReportMetrics wired through to Server

docs/reference/llm-resilience.md (new)

  • Chain architecture diagram
  • Complete env var reference table + emergency override table
  • Report worker ops (schedule, jitter, auto-disable, re-enable)
  • Prometheus metrics reference
  • Troubleshooting playbooks: all-providers-down, budget-exhausted, report-stale, fallback-storm
  • 7-day validation success criteria

Verification

go build ./cmd/tradingagent          ✓
go test -short -count=1 ./internal/llm/... ./internal/metrics/... ./internal/automation/... ./cmd/tradingagent/...  ✓
golangci-lint (changed files)        ✓ (remaining lint issues pre-existing)

Add Prometheus instruments and wiring for the full LLM resilience chain
and report worker pipeline (PR 6 of llm-resilience-refactor-plan).

Metrics added:
- tradingagent_llm_retry_total (counter, by provider)
- tradingagent_llm_budget_exhausted_total (counter)
- tradingagent_report_worker_success_total (counter, by strategy_id)
- tradingagent_report_worker_error_total (counter, by strategy_id)
- tradingagent_report_staleness_seconds (histogram, by strategy_id)

LLM package:
- RetryMetrics interface + WithRetryMetrics() on RetryProvider
- BudgetMetrics interface + WithBudgetMetrics() on BudgetGuardProvider
- WithChainRetryMetrics/WithChainBudgetMetrics chain options
- Emit retry metric on each backoff attempt
- Emit budget exhaustion metric before rejecting call

Runtime:
- retryMetricsAdapter bridges provider label to RetryMetrics interface
- chainOpts wires retry + budget metrics; report metrics wired on orch

Automation:
- ReportWorkerMetrics interface on JobOrchestrator
- WithReportMetrics() setter (mirrors existing WithJobMetrics pattern)
- Emit success/error per strategy in paperValidationReport loop
- Auto-disable: jobs with >=5 consecutive failures set Enabled=false
  and log at ERROR level in both wrapAndRun and runDirect paths

API:
- ReportMetrics interface defined in report_handlers.go
- ObserveReportStaleness emitted on GET /reports/latest
- Deps.ReportMetrics wired through to Server

Docs:
- docs/reference/llm-resilience.md: chain diagram, env var table,
  emergency overrides, report worker ops, Prometheus metrics reference,
  troubleshooting playbooks (all providers down / budget exhausted /
  report stale / fallback storm), 7-day success criteria
Copilot AI review requested due to automatic review settings April 17, 2026 04:22
@github-actions github-actions Bot added the size:m Medium effort label Apr 17, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR (6/6 of the LLM resilience refactor plan) wires Prometheus observability into the LLM provider chain and report worker pipeline, adds job auto-disable behavior after repeated failures, and ships an operational reference document.

Changes:

  • Adds new Prometheus instruments + recording helpers for LLM retry/budget exhaustion, report worker success/error, and report staleness.
  • Extends LLM chain layers (retry + budget guard) to emit metrics via new chain options and provider hooks.
  • Adds report-worker metrics emission and auto-disable logic (≥5 consecutive failures) in the automation orchestrator; emits report staleness from the “latest report” API endpoint; wires metrics through runtime.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
internal/metrics/metrics.go Registers new Prometheus instruments and exposes convenience record/observe methods.
internal/llm/retry.go Adds optional retry metrics hook and emits on each backoff attempt.
internal/llm/provider_chain.go Adds chain options to wire retry/budget metrics into their respective layers.
internal/llm/budget.go Adds optional budget exhaustion metrics hook before rejecting calls.
internal/automation/orchestrator.go Adds report-worker metrics plumbing and auto-disable logic after consecutive failures.
internal/automation/jobs_reports.go Emits per-strategy report worker success/error counters.
internal/api/server.go Wires ReportMetrics dependency into the API server.
internal/api/report_handlers.go Observes report staleness on GET .../reports/latest.
cmd/tradingagent/runtime.go Wires metrics implementations into the LLM chain and report worker orchestrator.
docs/reference/llm-resilience.md Adds operational reference documentation for chain behavior, metrics, and playbooks.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/llm/budget.go
Comment on lines 146 to 153
// Complete checks budget before delegating. Records usage after success.
func (b *BudgetGuardProvider) Complete(ctx context.Context, req CompletionRequest) (*CompletionResponse, error) {
if !b.budget.reserveRequest() {
if b.metrics != nil {
b.metrics.RecordLLMBudgetExhausted()
}
return nil, ErrBudgetExhausted
}
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Budget exhaustion metrics emission was added (RecordLLMBudgetExhausted() before returning ErrBudgetExhausted), but budget_test.go doesn’t cover this new behavior. Adding a test with a stub BudgetMetrics to assert it records exactly when reserveRequest() fails would help prevent silent regressions in observability.

Copilot uses AI. Check for mistakes.
Comment on lines 150 to +179
// Layer 3: retry
if cfg.maxAttempts > 1 {
retryOpts := []RetryOption{WithMaxAttempts(cfg.maxAttempts)}
if cfg.baseDelay > 0 {
retryOpts = append(retryOpts, WithBaseDelay(cfg.baseDelay))
}
p = NewRetryProvider(p, logger, retryOpts...)
rp := NewRetryProvider(p, logger, retryOpts...)
if cfg.metrics.retry != nil {
rp = rp.WithRetryMetrics(cfg.metrics.retry)
}
p = rp
}

// Layer 4: throttle
if cfg.throttle > 0 {
p = NewThrottledProvider(p, cfg.throttle)
}

// Layer 5: per-call timeout
if cfg.callTimeout > 0 {
p = &timeoutProvider{inner: p, timeout: cfg.callTimeout}
}

// Layer 6 (outermost): budget guard
if cfg.budget != nil {
p = NewBudgetGuardProvider(p, cfg.budget)
bg := NewBudgetGuardProvider(p, cfg.budget)
if cfg.metrics.budget != nil {
bg = bg.WithBudgetMetrics(cfg.metrics.budget)
}
p = bg
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New chain options (WithChainRetryMetrics, WithChainBudgetMetrics) are added and wired, but provider_chain_test.go doesn’t validate that these metrics are actually attached/invoked when the retry/budget layers are enabled. Consider adding tests that build a chain with these options and assert the corresponding stub metrics are called.

Copilot uses AI. Check for mistakes.
Comment on lines +58 to +60
if s.reportMetrics != nil {
s.reportMetrics.ObserveReportStaleness(id.String(), stale)
}
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handleGetLatestReport now emits ObserveReportStaleness, but report_handlers_test.go doesn’t assert that the metrics hook is called (or that the staleness value passed matches the response). Adding a test with a stub ReportMetrics implementation would help ensure this observation doesn’t get accidentally removed or miscomputed.

Copilot uses AI. Check for mistakes.
Comment thread cmd/tradingagent/runtime.go Outdated
Comment on lines +543 to +546
if cfg.RetryMaxAttempts > 1 {
opts = append(opts, llm.WithRetry(cfg.RetryMaxAttempts))
if appMetrics != nil {
opts = append(opts, llm.WithChainRetryMetrics(&retryMetricsAdapter{m: appMetrics, provider: strings.TrimSpace(cfg.DefaultProvider)}))
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WithChainRetryMetrics is wired using a retryMetricsAdapter that binds the label to cfg.DefaultProvider, but the retry layer wraps the fallback layer (retry → fallback). This means retries triggered after the chain has fallen back to the secondary provider will still be recorded under the primary provider label, which makes tradingagent_llm_retry_total{provider=...} misleading. Consider either (a) changing the metric/label to reflect “configured primary” rather than actual provider, or (b) restructuring the chain so retry is applied per-provider (e.g., wrap primary+secondary with retry before composing fallback) so the label matches reality.

Suggested change
if cfg.RetryMaxAttempts > 1 {
opts = append(opts, llm.WithRetry(cfg.RetryMaxAttempts))
if appMetrics != nil {
opts = append(opts, llm.WithChainRetryMetrics(&retryMetricsAdapter{m: appMetrics, provider: strings.TrimSpace(cfg.DefaultProvider)}))
//
// The retry metric label is intentionally bound to the configured primary
// provider, not the actual provider that may ultimately handle the call after
// fallback. This keeps the label semantics explicit and avoids attributing
// secondary-provider retries to the wrong runtime provider identity.
if cfg.RetryMaxAttempts > 1 {
opts = append(opts, llm.WithRetry(cfg.RetryMaxAttempts))
if appMetrics != nil {
opts = append(opts, llm.WithChainRetryMetrics(&retryMetricsAdapter{
m: appMetrics,
provider: fmt.Sprintf("configured_primary:%s", strings.TrimSpace(cfg.DefaultProvider)),
}))

Copilot uses AI. Check for mistakes.
Comment thread internal/automation/orchestrator.go Outdated
Comment on lines +111 to +115
// ReportWorkerMetrics captures report worker success/error/staleness.
type ReportWorkerMetrics interface {
RecordReportWorkerSuccess(strategyID string)
RecordReportWorkerError(strategyID string)
ObserveReportStaleness(strategyID string, seconds float64)
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReportWorkerMetrics includes ObserveReportStaleness, but the automation package only uses reportMetrics for report-worker success/error emission (staleness is observed in the API handler). Keeping an unused method in this interface forces any alternative implementation to add an unrelated method. Consider splitting this into a narrower report-worker interface (success/error only) and keeping staleness observation on the API-side ReportMetrics interface.

Suggested change
// ReportWorkerMetrics captures report worker success/error/staleness.
type ReportWorkerMetrics interface {
RecordReportWorkerSuccess(strategyID string)
RecordReportWorkerError(strategyID string)
ObserveReportStaleness(strategyID string, seconds float64)
// ReportWorkerMetrics captures report worker success/error emission.
type ReportWorkerMetrics interface {
RecordReportWorkerSuccess(strategyID string)
RecordReportWorkerError(strategyID string)

Copilot uses AI. Check for mistakes.
Comment on lines +325 to +331
if job.ConsecutiveFailures >= autoDisableThreshold {
job.Enabled = false
o.logger.Error("automation: auto-disabled job after consecutive failures",
slog.String("job", job.Name),
slog.Int("consecutive_failures", job.ConsecutiveFailures),
)
}
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Auto-disable behavior is newly introduced here (disable job after autoDisableThreshold consecutive failures), but there’s no unit test asserting that Enabled flips to false once the threshold is reached (and stays disabled for scheduled runs). Adding a test in orchestrator_test.go for both the manual (runDirect) and scheduled (wrapAndRun) paths would prevent regressions.

Copilot uses AI. Check for mistakes.
Comment on lines +273 to +311
// RecordLLMRetry increments the retry counter for a given provider.
func (m *Metrics) RecordLLMRetry(provider string) {
if m == nil {
return
}
m.LLMRetryTotal.WithLabelValues(provider).Inc()
}

// RecordLLMBudgetExhausted increments the budget exhaustion counter.
func (m *Metrics) RecordLLMBudgetExhausted() {
if m == nil {
return
}
m.LLMBudgetExhaustedTotal.Inc()
}

// RecordReportWorkerSuccess increments the report success counter for a strategy.
func (m *Metrics) RecordReportWorkerSuccess(strategyID string) {
if m == nil {
return
}
m.ReportWorkerSuccessTotal.WithLabelValues(strategyID).Inc()
}

// RecordReportWorkerError increments the report error counter for a strategy.
func (m *Metrics) RecordReportWorkerError(strategyID string) {
if m == nil {
return
}
m.ReportWorkerErrorTotal.WithLabelValues(strategyID).Inc()
}

// ObserveReportStaleness records how stale a report is at query time.
func (m *Metrics) ObserveReportStaleness(strategyID string, seconds float64) {
if m == nil {
return
}
m.ReportStaleness.WithLabelValues(strategyID).Observe(seconds)
}
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New metric instruments and convenience methods were added (RecordLLMRetry, RecordLLMBudgetExhausted, RecordReportWorkerSuccess/Error, ObserveReportStaleness), but internal/metrics/metrics_test.go doesn’t exercise them (nor validate the emitted series). Adding coverage similar to TestNewCounters / TestConvenienceMethods would catch naming/registration/label regressions.

Copilot uses AI. Check for mistakes.
Comment thread internal/llm/retry.go
Comment on lines 123 to +128
if attempt > 0 {
delay := r.backoffDelay(attempt - 1)

if r.metrics != nil {
r.metrics.RecordLLMRetry()
}
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retry metrics emission was added (r.metrics.RecordLLMRetry() on attempts > 0), but retry_test.go doesn’t currently assert that retries increment/record metrics. Consider adding a small stub RetryMetrics implementation in tests and verifying it’s called exactly once per backoff attempt when a retryable error occurs.

Copilot uses AI. Check for mistakes.
@PatrickFanella
Copy link
Copy Markdown
Owner Author

@copilot apply changes based on the comments in this thread

Agent-Logs-Url: https://github.com/PatrickFanella/augr/sessions/3c3eb647-082d-4ab2-8e2d-72f44dbd924d

Co-authored-by: PatrickFanella <61631520+PatrickFanella@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 17, 2026

@copilot apply changes based on the comments in this thread

Implemented the review-thread follow-ups in commit 7fca608: added the missing observability/unit coverage (budget, retry, provider-chain metrics wiring, report staleness handler metrics, and new metrics instruments), narrowed ReportWorkerMetrics to success/error only, added auto-disable threshold tests for both manual and scheduled paths, and updated retry metric labeling semantics to configured_primary:<provider> to avoid misleading attribution.

@PatrickFanella PatrickFanella merged commit 2e70de6 into main Apr 17, 2026
1 check passed
@PatrickFanella PatrickFanella deleted the feat/llm-resilience-observability branch April 17, 2026 05:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:m Medium effort

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants