feat(observability): wire LLM + report worker metrics, auto-disable jobs by PatrickFanella · Pull Request #628 · PatrickFanella/augr

PatrickFanella · 2026-04-17T04:22:22Z

Summary

PR 6 of 6 from docs/design/llm-resilience-refactor-plan.md.

Wires Prometheus observability into the LLM provider chain and report worker pipeline, adds auto-disable logic for misbehaving jobs, and adds an operational reference doc.

New Metrics

Metric	Type	Labels	Description
`tradingagent_llm_retry_total`	Counter	provider	Retry attempts by provider
`tradingagent_llm_budget_exhausted_total`	Counter	—	Budget exhaustion rejections
`tradingagent_report_worker_success_total`	Counter	strategy_id	Successful report generations
`tradingagent_report_worker_error_total`	Counter	strategy_id	Failed report generations
`tradingagent_report_staleness_seconds`	Histogram	strategy_id	Report age at query time

Changes

`internal/llm`

RetryMetrics interface + WithRetryMetrics() on RetryProvider — emits on each backoff attempt
BudgetMetrics interface + WithBudgetMetrics() on BudgetGuardProvider — emits before rejecting exhausted calls
WithChainRetryMetrics / WithChainBudgetMetrics chain options in provider_chain.go

`internal/metrics`

5 new instruments registered on the private registry
New recording methods: RecordLLMRetry, RecordLLMBudgetExhausted, RecordReportWorkerSuccess, RecordReportWorkerError, ObserveReportStaleness
*Metrics satisfies llm.BudgetMetrics, api.ReportMetrics, and automation.ReportWorkerMetrics interfaces

`cmd/tradingagent/runtime`

retryMetricsAdapter bridges provider label to the no-arg RetryMetrics interface
chainOpts wires retry + budget metrics into the chain
WithReportMetrics(appMetrics) called alongside WithJobMetrics

`internal/automation`

ReportWorkerMetrics interface on JobOrchestrator + WithReportMetrics() setter
Emit success/error per strategy in the paperValidationReport loop
Auto-disable: jobs with ≥5 consecutive failures have Enabled set to false and log at ERROR. Applied in both wrapAndRun and runDirect. Re-enable via PUT /api/v1/automation/jobs/{name}/enable.

`internal/api`

ReportMetrics interface in report_handlers.go
ObserveReportStaleness emitted on GET /api/v1/strategies/{id}/reports/latest
Deps.ReportMetrics wired through to Server

`docs/reference/llm-resilience.md` (new)

Chain architecture diagram
Complete env var reference table + emergency override table
Report worker ops (schedule, jitter, auto-disable, re-enable)
Prometheus metrics reference
Troubleshooting playbooks: all-providers-down, budget-exhausted, report-stale, fallback-storm
7-day validation success criteria

Verification

go build ./cmd/tradingagent          ✓
go test -short -count=1 ./internal/llm/... ./internal/metrics/... ./internal/automation/... ./cmd/tradingagent/...  ✓
golangci-lint (changed files)        ✓ (remaining lint issues pre-existing)

Add Prometheus instruments and wiring for the full LLM resilience chain and report worker pipeline (PR 6 of llm-resilience-refactor-plan). Metrics added: - tradingagent_llm_retry_total (counter, by provider) - tradingagent_llm_budget_exhausted_total (counter) - tradingagent_report_worker_success_total (counter, by strategy_id) - tradingagent_report_worker_error_total (counter, by strategy_id) - tradingagent_report_staleness_seconds (histogram, by strategy_id) LLM package: - RetryMetrics interface + WithRetryMetrics() on RetryProvider - BudgetMetrics interface + WithBudgetMetrics() on BudgetGuardProvider - WithChainRetryMetrics/WithChainBudgetMetrics chain options - Emit retry metric on each backoff attempt - Emit budget exhaustion metric before rejecting call Runtime: - retryMetricsAdapter bridges provider label to RetryMetrics interface - chainOpts wires retry + budget metrics; report metrics wired on orch Automation: - ReportWorkerMetrics interface on JobOrchestrator - WithReportMetrics() setter (mirrors existing WithJobMetrics pattern) - Emit success/error per strategy in paperValidationReport loop - Auto-disable: jobs with >=5 consecutive failures set Enabled=false and log at ERROR level in both wrapAndRun and runDirect paths API: - ReportMetrics interface defined in report_handlers.go - ObserveReportStaleness emitted on GET /reports/latest - Deps.ReportMetrics wired through to Server Docs: - docs/reference/llm-resilience.md: chain diagram, env var table, emergency overrides, report worker ops, Prometheus metrics reference, troubleshooting playbooks (all providers down / budget exhausted / report stale / fallback storm), 7-day success criteria

Copilot

Pull request overview

This PR (6/6 of the LLM resilience refactor plan) wires Prometheus observability into the LLM provider chain and report worker pipeline, adds job auto-disable behavior after repeated failures, and ships an operational reference document.

Changes:

Adds new Prometheus instruments + recording helpers for LLM retry/budget exhaustion, report worker success/error, and report staleness.
Extends LLM chain layers (retry + budget guard) to emit metrics via new chain options and provider hooks.
Adds report-worker metrics emission and auto-disable logic (≥5 consecutive failures) in the automation orchestrator; emits report staleness from the “latest report” API endpoint; wires metrics through runtime.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
internal/metrics/metrics.go	Registers new Prometheus instruments and exposes convenience record/observe methods.
internal/llm/retry.go	Adds optional retry metrics hook and emits on each backoff attempt.
internal/llm/provider_chain.go	Adds chain options to wire retry/budget metrics into their respective layers.
internal/llm/budget.go	Adds optional budget exhaustion metrics hook before rejecting calls.
internal/automation/orchestrator.go	Adds report-worker metrics plumbing and auto-disable logic after consecutive failures.
internal/automation/jobs_reports.go	Emits per-strategy report worker success/error counters.
internal/api/server.go	Wires `ReportMetrics` dependency into the API server.
internal/api/report_handlers.go	Observes report staleness on `GET .../reports/latest`.
cmd/tradingagent/runtime.go	Wires metrics implementations into the LLM chain and report worker orchestrator.
docs/reference/llm-resilience.md	Adds operational reference documentation for chain behavior, metrics, and playbooks.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-17T04:27:29Z

 // Complete checks budget before delegating. Records usage after success.
 func (b *BudgetGuardProvider) Complete(ctx context.Context, req CompletionRequest) (*CompletionResponse, error) {
 	if !b.budget.reserveRequest() {
+		if b.metrics != nil {
+			b.metrics.RecordLLMBudgetExhausted()
+		}
 		return nil, ErrBudgetExhausted
 	}


Budget exhaustion metrics emission was added (RecordLLMBudgetExhausted() before returning ErrBudgetExhausted), but budget_test.go doesn’t cover this new behavior. Adding a test with a stub BudgetMetrics to assert it records exactly when reserveRequest() fails would help prevent silent regressions in observability.

Copilot · 2026-04-17T04:27:29Z

 	// Layer 3: retry
 	if cfg.maxAttempts > 1 {
 		retryOpts := []RetryOption{WithMaxAttempts(cfg.maxAttempts)}
 		if cfg.baseDelay > 0 {
 			retryOpts = append(retryOpts, WithBaseDelay(cfg.baseDelay))
 		}
-		p = NewRetryProvider(p, logger, retryOpts...)
+		rp := NewRetryProvider(p, logger, retryOpts...)
+		if cfg.metrics.retry != nil {
+			rp = rp.WithRetryMetrics(cfg.metrics.retry)
+		}
+		p = rp
 	}

 	// Layer 4: throttle
 	if cfg.throttle > 0 {
 		p = NewThrottledProvider(p, cfg.throttle)
 	}

 	// Layer 5: per-call timeout
 	if cfg.callTimeout > 0 {
 		p = &timeoutProvider{inner: p, timeout: cfg.callTimeout}
 	}

 	// Layer 6 (outermost): budget guard
 	if cfg.budget != nil {
-		p = NewBudgetGuardProvider(p, cfg.budget)
+		bg := NewBudgetGuardProvider(p, cfg.budget)
+		if cfg.metrics.budget != nil {
+			bg = bg.WithBudgetMetrics(cfg.metrics.budget)
+		}
+		p = bg


New chain options (WithChainRetryMetrics, WithChainBudgetMetrics) are added and wired, but provider_chain_test.go doesn’t validate that these metrics are actually attached/invoked when the retry/budget layers are enabled. Consider adding tests that build a chain with these options and assert the corresponding stub metrics are called.

Copilot · 2026-04-17T04:27:30Z

+	if s.reportMetrics != nil {
+		s.reportMetrics.ObserveReportStaleness(id.String(), stale)
+	}


handleGetLatestReport now emits ObserveReportStaleness, but report_handlers_test.go doesn’t assert that the metrics hook is called (or that the staleness value passed matches the response). Adding a test with a stub ReportMetrics implementation would help ensure this observation doesn’t get accidentally removed or miscomputed.

Copilot · 2026-04-17T04:27:30Z

 	if cfg.RetryMaxAttempts > 1 {
 		opts = append(opts, llm.WithRetry(cfg.RetryMaxAttempts))
+		if appMetrics != nil {
+			opts = append(opts, llm.WithChainRetryMetrics(&retryMetricsAdapter{m: appMetrics, provider: strings.TrimSpace(cfg.DefaultProvider)}))


WithChainRetryMetrics is wired using a retryMetricsAdapter that binds the label to cfg.DefaultProvider, but the retry layer wraps the fallback layer (retry → fallback). This means retries triggered after the chain has fallen back to the secondary provider will still be recorded under the primary provider label, which makes tradingagent_llm_retry_total{provider=...} misleading. Consider either (a) changing the metric/label to reflect “configured primary” rather than actual provider, or (b) restructuring the chain so retry is applied per-provider (e.g., wrap primary+secondary with retry before composing fallback) so the label matches reality.

Suggested change

if cfg.RetryMaxAttempts > 1 {

opts = append(opts, llm.WithRetry(cfg.RetryMaxAttempts))

if appMetrics != nil {

opts = append(opts, llm.WithChainRetryMetrics(&retryMetricsAdapter{m: appMetrics, provider: strings.TrimSpace(cfg.DefaultProvider)}))

//

// The retry metric label is intentionally bound to the configured primary

// provider, not the actual provider that may ultimately handle the call after

// fallback. This keeps the label semantics explicit and avoids attributing

// secondary-provider retries to the wrong runtime provider identity.

if cfg.RetryMaxAttempts > 1 {

opts = append(opts, llm.WithRetry(cfg.RetryMaxAttempts))

if appMetrics != nil {

opts = append(opts, llm.WithChainRetryMetrics(&retryMetricsAdapter{

m: appMetrics,

provider: fmt.Sprintf("configured_primary:%s", strings.TrimSpace(cfg.DefaultProvider)),

}))

Copilot · 2026-04-17T04:27:30Z

+// ReportWorkerMetrics captures report worker success/error/staleness.
+type ReportWorkerMetrics interface {
+	RecordReportWorkerSuccess(strategyID string)
+	RecordReportWorkerError(strategyID string)
+	ObserveReportStaleness(strategyID string, seconds float64)


ReportWorkerMetrics includes ObserveReportStaleness, but the automation package only uses reportMetrics for report-worker success/error emission (staleness is observed in the API handler). Keeping an unused method in this interface forces any alternative implementation to add an unrelated method. Consider splitting this into a narrower report-worker interface (success/error only) and keeping staleness observation on the API-side ReportMetrics interface.

Suggested change

// ReportWorkerMetrics captures report worker success/error/staleness.

type ReportWorkerMetrics interface {

RecordReportWorkerSuccess(strategyID string)

RecordReportWorkerError(strategyID string)

ObserveReportStaleness(strategyID string, seconds float64)

// ReportWorkerMetrics captures report worker success/error emission.

type ReportWorkerMetrics interface {

RecordReportWorkerSuccess(strategyID string)

RecordReportWorkerError(strategyID string)

Copilot · 2026-04-17T04:27:31Z

+		if job.ConsecutiveFailures >= autoDisableThreshold {
+			job.Enabled = false
+			o.logger.Error("automation: auto-disabled job after consecutive failures",
+				slog.String("job", job.Name),
+				slog.Int("consecutive_failures", job.ConsecutiveFailures),
+			)
+		}


Auto-disable behavior is newly introduced here (disable job after autoDisableThreshold consecutive failures), but there’s no unit test asserting that Enabled flips to false once the threshold is reached (and stays disabled for scheduled runs). Adding a test in orchestrator_test.go for both the manual (runDirect) and scheduled (wrapAndRun) paths would prevent regressions.

Copilot · 2026-04-17T04:27:31Z

+// RecordLLMRetry increments the retry counter for a given provider.
+func (m *Metrics) RecordLLMRetry(provider string) {
+	if m == nil {
+		return
+	}
+	m.LLMRetryTotal.WithLabelValues(provider).Inc()
+}
+
+// RecordLLMBudgetExhausted increments the budget exhaustion counter.
+func (m *Metrics) RecordLLMBudgetExhausted() {
+	if m == nil {
+		return
+	}
+	m.LLMBudgetExhaustedTotal.Inc()
+}
+
+// RecordReportWorkerSuccess increments the report success counter for a strategy.
+func (m *Metrics) RecordReportWorkerSuccess(strategyID string) {
+	if m == nil {
+		return
+	}
+	m.ReportWorkerSuccessTotal.WithLabelValues(strategyID).Inc()
+}
+
+// RecordReportWorkerError increments the report error counter for a strategy.
+func (m *Metrics) RecordReportWorkerError(strategyID string) {
+	if m == nil {
+		return
+	}
+	m.ReportWorkerErrorTotal.WithLabelValues(strategyID).Inc()
+}
+
+// ObserveReportStaleness records how stale a report is at query time.
+func (m *Metrics) ObserveReportStaleness(strategyID string, seconds float64) {
+	if m == nil {
+		return
+	}
+	m.ReportStaleness.WithLabelValues(strategyID).Observe(seconds)
+}


New metric instruments and convenience methods were added (RecordLLMRetry, RecordLLMBudgetExhausted, RecordReportWorkerSuccess/Error, ObserveReportStaleness), but internal/metrics/metrics_test.go doesn’t exercise them (nor validate the emitted series). Adding coverage similar to TestNewCounters / TestConvenienceMethods would catch naming/registration/label regressions.

Copilot · 2026-04-17T04:27:31Z

 		if attempt > 0 {
 			delay := r.backoffDelay(attempt - 1)

+			if r.metrics != nil {
+				r.metrics.RecordLLMRetry()
+			}


Retry metrics emission was added (r.metrics.RecordLLMRetry() on attempts > 0), but retry_test.go doesn’t currently assert that retries increment/record metrics. Consider adding a small stub RetryMetrics implementation in tests and verifying it’s called exactly once per backoff attempt when a retryable error occurs.

PatrickFanella · 2026-04-17T04:43:42Z

@copilot apply changes based on the comments in this thread

Agent-Logs-Url: https://github.com/PatrickFanella/augr/sessions/3c3eb647-082d-4ab2-8e2d-72f44dbd924d Co-authored-by: PatrickFanella <61631520+PatrickFanella@users.noreply.github.com>

Copilot · 2026-04-17T04:52:18Z

@copilot apply changes based on the comments in this thread

Implemented the review-thread follow-ups in commit 7fca608: added the missing observability/unit coverage (budget, retry, provider-chain metrics wiring, report staleness handler metrics, and new metrics instruments), narrowed ReportWorkerMetrics to success/error only, added auto-disable threshold tests for both manual and scheduled paths, and updated retry metric labeling semantics to configured_primary:<provider> to avoid misleading attribution.

Copilot AI review requested due to automatic review settings April 17, 2026 04:22

github-actions Bot added the size:m Medium effort label Apr 17, 2026

Copilot started reviewing on behalf of PatrickFanella April 17, 2026 04:22 View session

Copilot AI reviewed Apr 17, 2026

View reviewed changes

Copilot started work on behalf of PatrickFanella April 17, 2026 04:43 View session

test: address PR review feedback for observability wiring

7fca608

Agent-Logs-Url: https://github.com/PatrickFanella/augr/sessions/3c3eb647-082d-4ab2-8e2d-72f44dbd924d Co-authored-by: PatrickFanella <61631520+PatrickFanella@users.noreply.github.com>

Copilot finished work on behalf of PatrickFanella April 17, 2026 04:53

PatrickFanella merged commit 2e70de6 into main Apr 17, 2026
1 check passed

PatrickFanella deleted the feat/llm-resilience-observability branch April 17, 2026 05:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(observability): wire LLM + report worker metrics, auto-disable jobs#628

feat(observability): wire LLM + report worker metrics, auto-disable jobs#628
PatrickFanella merged 2 commits into
mainfrom
feat/llm-resilience-observability

PatrickFanella commented Apr 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

PatrickFanella commented Apr 17, 2026

Uh oh!

Copilot AI commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

PatrickFanella commented Apr 17, 2026

Summary

New Metrics

Changes

internal/llm

internal/metrics

cmd/tradingagent/runtime

internal/automation

internal/api

docs/reference/llm-resilience.md (new)

Verification

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

PatrickFanella commented Apr 17, 2026

Uh oh!

Copilot AI commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

`internal/llm`

`internal/metrics`

`cmd/tradingagent/runtime`

`internal/automation`

`internal/api`

`docs/reference/llm-resilience.md` (new)