feat: add Langfuse metrics API for per-agent token/cost stats and trace tags#47
feat: add Langfuse metrics API for per-agent token/cost stats and trace tags#47
Conversation
There was a problem hiding this comment.
Pull request overview
Adds Langfuse “metrics v1” querying to retrieve per-agent token/cost usage stats, and improves trace/span structure and tagging to make Langfuse observability more useful (persona filtering + proper span nesting for vectorstore ops).
Changes:
- Add
Client.GetAgentStats(+ global wrapper) and implement Langfuse Metrics v1/api/public/metricsquery/parse logic. - Add Ginkgo coverage for metrics querying/parsing and global behavior.
- Improve Langfuse trace tagging (persona/agent) and add parent spans for vectorstore add/search to avoid orphan spans.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/memory/vector/store.go | Adds parent spans for vectorstore.add / vectorstore.search and operation attributes. |
| pkg/langfuse/metrics.go | Implements Langfuse Metrics v1 query + parsing helpers and new request/response types. |
| pkg/langfuse/metrics_test.go | Adds test coverage for GetAgentStats, parsing, error cases, and global wrapper behavior. |
| pkg/langfuse/client.go | Extends Client interface and adds package-level GetAgentStats + noop implementation. |
| pkg/langfuse/langfusefakes/fake_client.go | Updates generated fake to include GetAgentStats. |
| pkg/expert/modelprovider/model.go | Removes debug logging for token tailoring enabled. |
| pkg/app/app.go | Adds persona tag to messenger traces and introduces AG-UI root span with Langfuse trace attributes/tags. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
0853770 to
0e69215
Compare
…ce tags - Add GetAgentStats to langfuse.Client interface for querying per-agent token usage and cost statistics via the Langfuse Metrics v1 API - New metrics.go with AgentUsageStats, GetAgentStatsRequest types and implementation using observations view grouped by traceName - Handle v1 API quirks: string-encoded token counts, nullable costs, aggregation_measure key format - Comprehensive test coverage (12 new specs) including null costs, string parsing, filtering, error handling - Regenerate counterfeiter fakes for updated Client interface - Add persona/agent name to langfuse.trace.tags in both AG-UI and messenger handler paths for dashboard filtering - Add vectorstore.add and vectorstore.search parent spans for cleaner Langfuse trace nesting
0e69215 to
b0d77ea
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Use OTel Baggage to propagate langfuse.trace.tags to all spans in the trace, including trpc-agent-go internal spans. The Langfuse exporter's baggageBatchSpanProcessor copies baggage members onto every span at start time, ensuring tags appear even when the library's own spans become the trace root. Also keep the explicit StringSlice attribute on our handler spans as the primary/correct format (string[] vs comma-separated string).
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
…g config - Fix NewClient() not setting config field on client struct, causing GetAgentStats to generate URLs with no host (https:///api/...) - Use actual agent name from context as Langfuse trace root span name in orchestrator.Chat() instead of hardcoded 'codeowner.chat' - Update adaptive loop span in tree_v2 to use agent-prefixed name (e.g. 'devops-copilot.adaptive_loop') instead of 'reactree.adaptive_loop' - Add trace-cost-calculator example that queries Langfuse API for per-agent usage stats
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Remove langfuse.trace.tags from vectorstore.add and vectorstore.search spans — these trace-level attributes caused Langfuse to promote child spans into independent root traces - Remove langfuse.trace.input and langfuse.trace.tags from the adaptive loop span in tree_v2 — same promotion issue - Use regular span attributes (vectorstore.agent, reactree.agent) instead for observability without trace splitting - Restore agent name prefix on adaptive loop span name
Halguard's generateText() calls model.GenerateContent() directly,
bypassing the trpc-agent-go runner which normally instruments LLM calls.
This meant halguard pre-check and post-check model calls were invisible
in Langfuse traces.
- Add OTel span to generateText following 'chat {model}' naming convention
- Record gen_ai semantic attributes (request model, response model, tokens)
- Spans appear as children of halguard.PreCheck/PostCheck spans
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 14 out of 14 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Fix metricsQuery doc comment: says v1 (not v2) endpoint - Validate req.Duration > 0 in GetAgentStats to prevent zero/negative time window queries - Fix GetAgentStats package-level docstring: says nil,nil (not empty slice) when no client is configured - Add pii.Redact to messenger trace input for consistent PII handling (AG-UI path already redacted) - Merge baggage instead of replacing: withLangfuseTraceBaggage now preserves existing upstream baggage values
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Two-layer fix for sub-agents that output commands as text instead of executing them via run_shell, or refuse with 'I don't know' when tools are available. Layer 1 - Stronger system prompt (subagent_instruction.go): - Added TOOL USE IS MANDATORY directive - Prohibit 'I don't know' / 'I don't have access' / 'I cannot' refusals - Require executing embedded scripts via run_shell, not echoing as markdown - Explicit instruction that tools exist to gather real data Layer 2 - Runtime guard (create_agent.go): - Zero-tool-use guard detects when toolCallCount==0 with tools available - Annotates output with warning and sets status='tool_use_failure' - Gives parent agent clear signal to re-spawn the sub-agent Tests: - buildSubAgentInstruction: tool-use enforcement, allowlist, core directives - Zero-tool-use guard: table-driven condition tests, output annotation format
Root spans in AGUI and messenger paths used otel.Tracer() which resolves to the global OTel noop provider. The Langfuse exporter registers its own TracerProvider on trace.Tracer but never calls otel.SetTracerProvider(), so the root spans were never exported. All downstream spans (orchestrator, tree, vectorstore) became orphaned root-level Langfuse traces instead of children. Switch both paths to trace.Tracer and inline SetAttributes into Start() via oteltrace.WithAttributes for brevity.
Address PR review comments: - Set langfuse.trace.name to a.displayName() (agent name) instead of 'agui chat' / platform label so GetAgentStats can filter by agent. Channel/platform info is already in tags. - Log warnings in withLangfuseTraceBaggage when baggage creation fails instead of silently returning the original context.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 21 out of 21 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ontext - halguard.New: panic on nil textGenerator to fail fast instead of allowing a runtime nil-pointer dereference during PostCheck. - create_agent: use toolNameList (from selectedTools) instead of scopedRegistry.ToolNames() in the zero-tool-use guard message so the reported tools match what the sub-agent actually had. - app.go: set orchestratorcontext.WithAgent in the messenger path so downstream spans use the correct agent name instead of DefaultAgentName.
When a browser tab is present, bridgeBrowserTab creates a new context derived from the chromedp tab context which has no OTel span context. This caused child spans (adaptive_loop, invoke_agent, vectorstore.add) to get new traceIDs and appear as separate top-level traces in Langfuse instead of being nested under the parent trace. Fix: inject the parent's OTel span and baggage into the bridged context so that all downstream spans maintain the correct trace hierarchy and inherit Langfuse attributes (userId, sessionId, tags, metadata). Add tests verifying span context, baggage, per-request values, and deadline propagation through bridgeBrowserTab.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 22 out of 22 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // singleModelProvider adapts a single model.Model into the | ||
| // modelprovider.ModelProvider interface so that expert.ExpertBio.ToExpert can | ||
| // use a specific, pre-selected model rather than task-type-based routing. | ||
| // This is necessary because halguard selects diverse models for cross-model | ||
| // verification and needs to call each one individually. | ||
| type singleModelProvider struct { | ||
| key string | ||
| model model.Model | ||
| } | ||
|
|
||
| // GetModel always returns the single wrapped model regardless of task type. | ||
| func (p *singleModelProvider) GetModel(_ context.Context, _ modelprovider.TaskType) (modelprovider.ModelMap, error) { | ||
| return modelprovider.ModelMap{p.key: p.model}, nil | ||
| } |
There was a problem hiding this comment.
This file defines a local singleModelProvider, but the PR also adds expert/modelprovider.NewSingleModelProvider(...) with the same purpose. Keeping both duplicates logic and invites divergence (e.g., interface changes). Prefer using the shared helper here (and drop the local type), or remove the new helper file if it’s not intended for reuse.
pkg/reactree/subagent_instruction.go
Outdated
| instruction += "\nTOOL USE IS MANDATORY: You have tools — you MUST use them. " + | ||
| "NEVER say 'I don't know', 'I don't have access', or 'I cannot' when you have tools that can gather the information. " + | ||
| "If your goal describes a task and you have tools like run_shell, use them to execute the task. " + | ||
| "NEVER output commands, scripts, or code as text in your response — always execute them via the appropriate tool. " + | ||
| "If your goal contains a shell script or command to run, call run_shell to EXECUTE it — do NOT echo or display the script as markdown. " + | ||
| "You have tools for a reason: USE THEM to gather real data, then summarize the results. " + | ||
| "\nHITL REJECTION: If a tool call is rejected by the user with feedback suggesting a different tool or approach, " + |
There was a problem hiding this comment.
buildSubAgentInstruction unconditionally instructs the sub-agent to "call run_shell" when the goal contains a shell command/script. When toolNames does not include run_shell (or when toolNames is nil/empty), this directive conflicts with the allowlist and can cause the sub-agent to attempt an unavailable tool. Suggest gating the run_shell-specific text on toolNames containing run_shell, or rewriting it to reference "an available shell tool" / "if run_shell is available".
| instruction += "\nTOOL USE IS MANDATORY: You have tools — you MUST use them. " + | |
| "NEVER say 'I don't know', 'I don't have access', or 'I cannot' when you have tools that can gather the information. " + | |
| "If your goal describes a task and you have tools like run_shell, use them to execute the task. " + | |
| "NEVER output commands, scripts, or code as text in your response — always execute them via the appropriate tool. " + | |
| "If your goal contains a shell script or command to run, call run_shell to EXECUTE it — do NOT echo or display the script as markdown. " + | |
| "You have tools for a reason: USE THEM to gather real data, then summarize the results. " + | |
| "\nHITL REJECTION: If a tool call is rejected by the user with feedback suggesting a different tool or approach, " + | |
| // Detect whether run_shell is actually available so we don't instruct | |
| // the sub-agent to call a tool that is not in its allowlist. | |
| hasRunShell := false | |
| for _, name := range toolNames { | |
| if name == "run_shell" { | |
| hasRunShell = true | |
| break | |
| } | |
| } | |
| if hasRunShell { | |
| instruction += "\nTOOL USE IS MANDATORY: You have tools — you MUST use them. " + | |
| "NEVER say 'I don't know', 'I don't have access', or 'I cannot' when you have tools that can gather the information. " + | |
| "If your goal describes a task and you have tools like run_shell, use them to execute the task. " + | |
| "NEVER output commands, scripts, or code as text in your response — always execute them via the appropriate tool. " + | |
| "If your goal contains a shell script or command to run, call run_shell to EXECUTE it — do NOT echo or display the script as markdown. " + | |
| "You have tools for a reason: USE THEM to gather real data, then summarize the results. " | |
| } else { | |
| instruction += "\nTOOL USE IS MANDATORY: You have tools — you MUST use them. " + | |
| "NEVER say 'I don't know', 'I don't have access', or 'I cannot' when you have tools that can gather the information. " + | |
| "If your goal describes a task and you have tools suitable for executing commands or scripts, use them to execute the task. " + | |
| "NEVER output commands, scripts, or code as text in your response — always execute them via the appropriate available tool. " + | |
| "If your goal contains a shell script or command to run, call an appropriate available shell-execution tool to EXECUTE it — do NOT echo or display the script as markdown. " + | |
| "You have tools for a reason: USE THEM to gather real data, then summarize the results. " | |
| } | |
| instruction += "\nHITL REJECTION: If a tool call is rejected by the user with feedback suggesting a different tool or approach, " + |
| shouldFire := func(gi guardInput) bool { | ||
| return gi.toolCallCount == 0 && gi.result != "" && gi.status != "error" && !gi.timedOut && gi.numSelectedTools > 0 | ||
| } | ||
|
|
||
| DescribeTable("fires or skips based on conditions", | ||
| func(gi guardInput, expectFire bool) { | ||
| Expect(shouldFire(gi)).To(Equal(expectFire)) | ||
| }, |
There was a problem hiding this comment.
The added "zero-tool-use guard" specs mostly re-implement the guard predicate and string formatting rather than exercising the actual createAgentTool.executeInner behavior. This won’t catch regressions like the guard not firing due to event parsing changes, toolNameList mismatches, or status being overwritten later. Consider adding an integration-style unit test that runs executeInner with a stubbed runner/event stream to assert Status becomes tool_use_failure and Output is annotated only when toolCallCount == 0 and tools are available.
…_failure
Sub-agents (especially Gemini models) were echoing shell scripts as text
instead of calling run_shell. Two fixes:
1. Restructure subagent instruction: move MANDATORY tool-use rule to the
very first sentence (LLMs attend most to prompt start). Add dedicated
SCRIPT EXECUTION section targeting the exact failure mode of embedded
bash scripts being echoed as markdown.
2. Auto-retry on tool_use_failure: when zero-tool-use guard fires, retry
once with a reinforced prompt ('[RETRY] You MUST call run_shell...')
before returning failure to the parent agent. Uses -retry agent name
suffix for tracing.
Tests: fix 2 existing tests for new wording, add 5 new specs for SCRIPT
EXECUTION directive and auto-retry logic (149 specs pass).
Summary
Adds the ability to query Langfuse for per-agent token usage and cost statistics, and improves trace tagging for better observability.
Langfuse Metrics API
GetAgentStatsmethod onlangfuse.Clientinterface/api/public/metrics) using theobservationsview grouped bytraceNameAgentUsageStatsper agent:TotalCost,InputTokens,OutputTokens,TotalTokens,Countaggregation_measurekey formatTrace Tags
langfuse.trace.tagsin both AG-UI and messenger handler pathslangfuse.trace.name,langfuse.trace.input, and tagsVector Store Spans
vectorstore.addandvectorstore.searchoperationsTests
parseAgentStats,numericFromMap, and global function behaviorUsage Example