Skip to content

Tracking: OpenTelemetry metrics — Phase 1 #30

@lis186

Description

@lis186

Status: In progress (solo work, no review requested). This issue exists to make the work visible — design phase complete, implementation starting. No action needed from maintainers or contributors. If you spot something concerning, feel free to comment; otherwise the next signal will be the PR.

繁中摘要

  • ccxray 將支援把 cost / token / tool / MCP / skill 等 HTTP 層 metric 推到使用者的 OTel 後端(Grafana / Datadog / Honeycomb)。
  • 預設關閉。三層 tier opt-in:關閉 / 專案匿名 / 個人具名,工程師可單方面降級退出。
  • 對話內容不外送,只送 metadata 和聚合計數。
  • 已完成事前驗屍(11 風險,9 個解方全 ≥ 9/10),設計記錄與 OpenSpec change 都在分支上。
  • 開始實作中,進度會反映在 otel-metrics-phase1 分支。

What's being built

Phase 1 — multi-dimensional metrics emit under ccxray.* namespace covering:

  • Cost: tokens.input/output/cache_read/cache_creation, cost.usd, cache_hit_ratio
  • Usage: tool / MCP / skill / session / agent_type / provider invocation counters
  • Quality: errors{type}, stop_reason{reason}, latency p50/p95, max_tokens_hit_rate
  • Patterns: context_utilization, auto_compact_triggered, subagent_ratio, tools_per_turn
  • Governance: permission_mode, dangerous_tool, file_writes, provider_distribution
  • Sentinels: cardinality overflow, parser drift, reconciliation mismatch, OTel health state

Plus shared infrastructure: server/otel.js, server/otel-health.js, server/config-loader.js, server/parsers/ (schema-ized parser layer with snapshot fixtures).

Why default OFF + tiered opt-in

The hardest design constraint: ccxray users are individual developers. A telemetry feature that defaults to ON, or that exposes individual identity, would break the implicit trust contract — and a feature that lets managers track individual tool usage would trigger a backlash that kills adoption.

The design therefore:

  • Emits nothing unless the user creates .ccxray.json or sets the env vars.
  • Three-tier model: 0=disabled / 1=project anonymous / 2=personal named. Project config is upper bound; personal config can only equal or downgrade. Engineer can unilaterally opt out.
  • OTel failures never break ccxray: config errors fail fast at startup, init errors degrade silently, runtime errors absorbed by bounded queue + circuit breaker.

What's NOT being built (Phase 2 follow-up)

  • Span emit (traces) — Phase 1 is metrics only
  • ccxray.entry_id / dashboard_url attributes and /entry/:id drill-back route
  • Full payload export — that conflicts with ccxray's local-first design

Where the design lives

Branch: otel-metrics-phase1

Artifact Location
Design record + 11-risk pre-mortem + 3-plan comparison docs/otel-integration.html
Visual walkthrough (pure SVG, no external deps) docs/otel-phase1-overview.html
OpenSpec change (proposal / design / 6 capability specs / tasks) openspec/changes/add-otel-metrics-phase1/

Every claim in the visual walkthrough is cited back to a specific spec section.

Tracking

Implementation will land as additive commits on otel-metrics-phase1. The eventual PR will be a separate, focused review request. No interim review needed.

Estimated phase 1 scope per tasks.md: ~11 task groups, 60+ checkboxes, paced behind opt-in defaults so existing ccxray users see zero behavior change.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions