You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add an agentv import command that reads existing AI coding sessions from Claude Code, Codex CLI, and Copilot CLI, normalizes them into a tool-agnostic transcript format, and feeds them into the existing evaluator pipeline for grading. This enables comparing different clients and workspace setups from manually-run sessions without re-executing anything.
Bundled: rename agentv trace → agentv inspect to free up "trace" for its industry-standard meaning (OTel spans) and avoid terminology collision.
Motivation
Teams run the same task in Claude Code, Codex CLI, and Copilot CLI to compare quality — today there's no way to grade those sessions without re-running them
Workspace setup experiments (different CLAUDE.md files, system prompts, skills) produce sessions that should be gradeable after the fact
Grader iteration: import once, re-grade many times with different evaluators without burning API tokens
Industry alignment: every major eval framework (DeepEval, Braintrust, LangSmith) separates data import from evaluation
Design
Architecture: Two-step pipeline
Industry best practice separates import from evaluation. The data carries the output, not the provider.
When --transcript is provided, the orchestrator skips provider invocation and uses the pre-populated output from the transcript as the ProviderResponse. Evaluators run identically to live eval.
Bundled rename: agentv trace → agentv inspect
Current agentv trace subcommands are all result inspection tools, not OTel trace operations:
Current
Renamed
Purpose
agentv trace list
agentv inspect list
List result files
agentv trace show
agentv inspect show
Show result details with execution tree
agentv trace stats
agentv inspect stats
Compute percentile stats
agentv trace score
agentv inspect score
Re-run evaluators post-hoc
Keep agentv trace as a deprecated alias for one release cycle.
Agent Trace — emerging standard for AI code attribution (Cursor, Vercel, Google Jules)
entireio/cli — git-integrated session capture with checkpoint/rewind
Design clarifications
These resolve ambiguities an implementing agent would otherwise need to ask about.
How transcripts map to eval test cases
A transcript is not matched to eval tests by input string. Instead, --transcript provides a pre-populated ProviderResponse that replaces provider invocation entirely:
The eval YAML still defines tests with input and assert — these provide the grading criteria
The transcript provides the output (Message[]) that would normally come from a live provider
Matching is positional: transcript line 1 → test 1, transcript line 2 → test 2
If the eval has 1 test and the transcript has 1 line, they pair 1:1 (most common case)
If counts don't match, error with a clear message
This means a typical workflow is:
Run a task manually in Claude Code
Import the session → produces 1-line transcript (the whole session is one entry)
Write an eval YAML with 1 test that has the same input + the assertions you want
A session with multiple user messages is one transcript line (one test case). The entire conversation is captured in the output: Message[] array. This matches how agent evals work: one task = one session, even if it spans many turns.
The input field in the transcript captures the first user message (the initial task). All subsequent messages (follow-ups, clarifications) are part of the output conversation.
Orchestrator --transcript bypass mechanics
When --transcript is provided:
targets: in the eval YAML is ignored (no provider is invoked)
The target field in result JSONL is set to ${source.provider} from the transcript (e.g., claude-cli)
trials.count is forced to 1 (replaying the same transcript multiple times is meaningless)
workspace_template is ignored (no workspace is created)
The orchestrator constructs a ProviderResponse from the transcript: { output: line.output, tokenUsage: line.token_usage, durationMs: line.duration_ms, costUsd: line.cost_usd, startTime: line.source.timestamp }
Multiple transcripts for comparison:
# Run the same eval three times with different transcripts
agentv eval evals/auth.yaml --transcript .agentv/transcripts/claude-auth.jsonl
agentv eval evals/auth.yaml --transcript .agentv/transcripts/codex-auth.jsonl
agentv eval evals/auth.yaml --transcript .agentv/transcripts/copilot-auth.jsonl
# Each produces a separate result run, then compare
File placement in monorepo
packages/core/src/import/ # NEW — parser logic (reusable by CLI and SDK)
claude-parser.ts # Parse Claude Code session JSONL → Message[]
codex-parser.ts # Parse Codex CLI rollout JSONL → Message[]
copilot-parser.ts # Wraps existing copilot-log-parser.ts
session-discovery.ts # Unified session discovery (find latest, by id, by project)
types.ts # TranscriptEntry interface, ImportConfig
index.ts # Public API
apps/cli/src/commands/import/ # NEW — CLI command
index.ts # Subcommand registration (claude, codex, copilot)
claude.ts # CLI handler for `agentv import claude`
codex.ts # CLI handler for `agentv import codex`
copilot.ts # CLI handler for `agentv import copilot`
apps/cli/src/commands/inspect/ # RENAMED from trace/
index.ts # (rename trace → inspect)
show.ts
list.ts
stats.ts
score.ts
Parsers in packages/core so they're usable from both CLI and the programmatic SDK (evaluate() API).
Skip — these are hook/status events, not conversation
type: "system"
Skip — API errors, turn metrics (extract duration from here)
type: "file-history-snapshot"
Skip — file backup tracking
Subagent sessions ({uuid}/subagents/agent-{id}.jsonl): skip for v1. Import only the main session. Subagent import can be added later.
Token usage: aggregate usage blocks from all type: "assistant" messages. Each has { input_tokens, output_tokens, cache_creation_input_tokens, cache_read_input_tokens }.
Duration: use timestamp delta between first and last event in the session.
Codex CLI session parsing specifics
Rollout item type
Transcript handling
ResponseItem
→ Message[] — map user messages, assistant messages, tool outputs
EventMsg (TurnStarted/TurnComplete)
Extract turn timing + token usage from TurnComplete
TurnContext
Extract model name, CWD, policies
SessionMeta
Extract thread_id, git info, CLI version → source metadata
Compacted
Skip — compaction metadata
Tool calls: extract from ExecCommandEnd events within EventMsg. Note stdout/stderr are sanitized (cleared) — only aggregated_output is available.
Objective
Add an
agentv importcommand that reads existing AI coding sessions from Claude Code, Codex CLI, and Copilot CLI, normalizes them into a tool-agnostic transcript format, and feeds them into the existing evaluator pipeline for grading. This enables comparing different clients and workspace setups from manually-run sessions without re-executing anything.Bundled: rename
agentv trace→agentv inspectto free up "trace" for its industry-standard meaning (OTel spans) and avoid terminology collision.Motivation
Design
Architecture: Two-step pipeline
Industry best practice separates import from evaluation. The data carries the output, not the provider.
Why not a provider modifier (e.g.,
provider: claude-cli+session:block)?Terminology
~/.claude/.../*.jsonl,~/.codex/.../*.jsonl,~/.copilot/.../*.jsonlagentv import, input toagentv eval --transcriptagentv evalSession file locations (input)
~/.claude/projects/<encoded-path>/<uuid>.jsonltypefield:user,assistant,progress,system~/.codex/sessions/YYYY/MM/DD/rollout-*.jsonlRolloutItem:ResponseItem,EventMsg,TurnContext,SessionMeta~/.copilot/session-state/{id}/events.jsonlsession.start,user.message,assistant.message,tool.execution_*Transcript JSONL format (output of import)
Each line is a self-contained test case with pre-populated output:
{ "input": "Add JWT authentication to the Express API", "output": [ { "role": "user", "content": "Add JWT authentication to the Express API" }, { "role": "assistant", "content": "I'll add JWT auth...", "tool_calls": [ {"tool": "Read", "input": {"file_path": "/src/index.ts"}, "output": "...", "duration_ms": 45}, {"tool": "Edit", "input": {"file_path": "/src/auth.ts", "old_string": "...", "new_string": "..."}, "duration_ms": 120} ] } ], "token_usage": {"input": 15000, "output": 3200, "cached": 8000}, "duration_ms": 45000, "cost_usd": 0.12, "source": { "provider": "claude-cli", "session_id": "0763061e-9c91-4dee-9373-3bd69c817fd4", "model": "claude-opus-4-6", "version": "2.1.62", "timestamp": "2026-03-29T21:10:11.969Z", "git_branch": "main", "cwd": "/home/user/projects/myapp" } }The
outputfield uses the existingMessage[]schema frompackages/core/src/evaluation/providers/types.ts(snake_case wire format).CLI:
agentv importDefault output directory:
.agentv/transcripts/Evaluation with transcripts
Example eval YAML:
When
--transcriptis provided, the orchestrator skips provider invocation and uses the pre-populatedoutputfrom the transcript as theProviderResponse. Evaluators run identically to live eval.Bundled rename:
agentv trace→agentv inspectCurrent
agentv tracesubcommands are all result inspection tools, not OTel trace operations:agentv trace listagentv inspect listagentv trace showagentv inspect showagentv trace statsagentv inspect statsagentv trace scoreagentv inspect scoreKeep
agentv traceas a deprecated alias for one release cycle.Existing code to build on
packages/core/src/evaluation/providers/copilot-log.tspackages/core/src/evaluation/providers/copilot-log-parser.tspackages/core/src/evaluation/providers/copilot-session-discovery.tspackages/core/src/evaluation/providers/types.tsMessage[],ToolCall,ProviderResponse— the target formatpackages/core/src/evaluation/providers/cli.tspackages/core/src/evaluation/orchestrator.ts--transcriptbypass (skip provider, use pre-populated output)apps/cli/src/commands/trace/examples/showcase/offline-grader-benchmark/examples/features/copilot-log-eval/Implementation order
agentv import claude) — highest value, primary tool, sessions already on disk--transcriptflag — skip provider invocation, use pre-populated outputagentv import codex) — enables cross-client comparisonagentv trace→agentv inspectrename — with deprecated aliasagentv import copilot) — migrate existingcopilot-logparser logicAcceptance signals
agentv import claude --session-id <uuid>produces valid transcript JSONLagentv import claude --discover latest --project-path <path>auto-discovers sessionsagentv import codex --discover latestproduces valid transcript JSONLagentv import copilot --session-id <uuid>produces valid transcript JSONLagentv eval <file> --transcript <path>grades pre-populated transcripts without provider invocationagentv compareworks across results from different transcript sourcesagentv inspect list/show/stats/scorework identically to currentagentv tracesubcommandsagentv tracestill works as deprecated aliassourcemetadata (provider, session_id, model, timestamp, git_branch)Non-goals
copilot-logprovider (keep working, deprecate later)Industry research
Research across 6 eval frameworks confirms this architecture:
actual_outputonLLMTestCaseaevaluate()on pre-recorded dataechoprovider returns cached data through provider interfaceaeval chats to-testsconverts sessions to eval test setsAll separate import/transform from evaluation. None make the live provider double as a file reader.
Additional tools that parse these session formats:
Design clarifications
These resolve ambiguities an implementing agent would otherwise need to ask about.
How transcripts map to eval test cases
A transcript is not matched to eval tests by input string. Instead,
--transcriptprovides a pre-populated ProviderResponse that replaces provider invocation entirely:inputandassert— these provide the grading criteriaoutput(Message[]) that would normally come from a live providerThis means a typical workflow is:
agentv eval <eval.yaml> --transcript <transcript.jsonl>Multi-turn sessions → single transcript entry
A session with multiple user messages is one transcript line (one test case). The entire conversation is captured in the
output: Message[]array. This matches how agent evals work: one task = one session, even if it spans many turns.The
inputfield in the transcript captures the first user message (the initial task). All subsequent messages (follow-ups, clarifications) are part of theoutputconversation.Orchestrator
--transcriptbypass mechanicsWhen
--transcriptis provided:targets:in the eval YAML is ignored (no provider is invoked)targetfield in result JSONL is set to${source.provider}from the transcript (e.g.,claude-cli)trials.countis forced to 1 (replaying the same transcript multiple times is meaningless)workspace_templateis ignored (no workspace is created)ProviderResponsefrom the transcript:{ output: line.output, tokenUsage: line.token_usage, durationMs: line.duration_ms, costUsd: line.cost_usd, startTime: line.source.timestamp }Multiple transcripts for comparison:
File placement in monorepo
Parsers in
packages/coreso they're usable from both CLI and the programmatic SDK (evaluate()API).Default output naming
When
--outputis omitted:The directory
.agentv/transcripts/is created automatically if it doesn't exist.Cost calculation from token counts
Claude Code sessions store token counts but not dollar costs. The importer should:
usage.input_tokens,usage.output_tokens,usage.cache_creation_input_tokens,usage.cache_read_input_tokensfrom assistant messagescost_usd: null(not computed) — cost calculation requires model pricing tables which change frequently and are out of scopecostevaluator will work if the user provides acost_usdoverride in the eval YAML, or will skip/fail gracefully if nullSame approach for Codex CLI (has
usage.input_tokens/usage.output_tokensinturn.completedevents).Claude Code session parsing specifics
Event type mapping:
type: "user"Message { role: "user", content }type: "assistant"Message { role: "assistant", content, toolCalls }— extract tool_use/tool_result from content arraytype: "progress"type: "system"type: "file-history-snapshot"Subagent sessions (
{uuid}/subagents/agent-{id}.jsonl): skip for v1. Import only the main session. Subagent import can be added later.Token usage: aggregate
usageblocks from alltype: "assistant"messages. Each has{ input_tokens, output_tokens, cache_creation_input_tokens, cache_read_input_tokens }.Duration: use timestamp delta between first and last event in the session.
Codex CLI session parsing specifics
ResponseItemMessage[]— map user messages, assistant messages, tool outputsEventMsg (TurnStarted/TurnComplete)TurnCompleteTurnContextSessionMetasourcemetadataCompactedTool calls: extract from
ExecCommandEndevents withinEventMsg. Notestdout/stderrare sanitized (cleared) — onlyaggregated_outputis available.