feat: add core_failure telemetry with PII-safe input signatures (#245)

suryaiyer95 · claude · web-flow · commit 9aa1d399cda2 · 2026-03-18T22:33:01.000-07:00
* feat: add `core_failure` telemetry with PII-safe masking

Add a new `core_failure` event emitted on both soft failures
(`metadata.success === false`) and uncaught tool exceptions, with
privacy-preserving context for debugging:

- `classifyError()` — keyword-based error classification (parse, connection, timeout, validation, permission, internal, unknown)
- `computeInputSignature()` — records key names + value types/lengths, never actual values; truncates by dropping keys to preserve valid JSON
- `maskArgs()` — PII masking aligned to Rust SDK: 19 sensitive keys redacted, string literals in SQL replaced with `?`, recursive object traversal

Telemetry is fully isolated from tool execution — all tracking calls
are wrapped in `try/catch` so telemetry failures never break tools.
`Truncate.output()` runs outside the telemetry error boundary so I/O
errors aren't misattributed as tool failures.

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;

* feat: add `skill_used` telemetry event

Tracks which skill is loaded and where it came from (`builtin`, `global`,
or `project`) with duration. Wrapped in try/catch — cannot break skill
loading. Docs table updated.

Co-Authored-By: Claude Sonnet 4.6 &lt;noreply@anthropic.com&gt;

* feat: add \`sql_execute_failure\` telemetry for SQL execution errors

\`core_failure\` is for internal tool failures. SQL execution via the
dispatcher is a separate concern — soft errors are returned as results
(not thrown), so \`core_failure\` never fires for them.

New \`sql_execute_failure\` event captures: warehouse type, query type,
error message (truncated to 500 chars), and PII-masked SQL. Fires from
the \`sql.execute\` handler catch path.

Co-Authored-By: Claude Sonnet 4.6 &lt;noreply@anthropic.com&gt;

* feat: add persistent machine ID from \`~/.altimate/machine-id\`

Generated once as a random UUID and stored at \`~/.altimate/machine-id\`
(alongside \`altimate.json\`, \`connections.json\`, etc.). Sent as
\`machine_id\` in \`customDimensions\` on every App Insights event.
No PII — pure random UUID, never tied to user identity.

Co-Authored-By: Claude Sonnet 4.6 &lt;noreply@anthropic.com&gt;

* fix: correct `masked_sql` field and `ERROR_PATTERNS` ordering in telemetry

- `sql_execute_failure`: use `Telemetry.maskString(params.sql)` instead of
  `Telemetry.maskArgs({ sql: params.sql })` — the latter serializes a JSON
  object string `{"sql":"..."}` rather than the raw masked SQL
- `ERROR_PATTERNS`: move `permission` before `validation` so errors like
  "Invalid permission denied" are not misclassified as `validation_error`

Co-Authored-By: Claude Sonnet 4.6 &lt;noreply@anthropic.com&gt;

* perf: skip success \`tool_call\` telemetry for file tools

Read/write/edit/glob/grep/bash succeed constantly in normal operation —
tracking every success is high-volume noise with no actionable signal.
Failures (hard throws and soft failures) are still fully captured via
\`tool_call\` (status=error) and \`core_failure\`.

Co-Authored-By: Claude Sonnet 4.6 &lt;noreply@anthropic.com&gt;

* docs: clarify `core_failure` event description in telemetry docs

Co-Authored-By: Claude Sonnet 4.6 &lt;noreply@anthropic.com&gt;

* docs: simplify `core_failure` description

Co-Authored-By: Claude Sonnet 4.6 &lt;noreply@anthropic.com&gt;

* fix: mask error messages before sending to telemetry

Error messages from SQL engines can embed data values (e.g.
"Value 'john@email.com' does not match type INTEGER"). Apply
maskString() to all error_message fields before transmission,
consistent with how args are already masked.

Affects: core_failure (tool.ts), sql_execute_failure (register.ts)

Co-Authored-By: Claude Sonnet 4.6 &lt;noreply@anthropic.com&gt;

* fix: security hardening for telemetry PII safety

- Mask error messages in `native_call` (dispatcher.ts) and `warehouse_connect` (registry.ts) — these were sending raw error strings that could embed credentials or query fragments
- Fix soft-failure `error_message` fallback: drop `result.output` as a source (raw tool output could contain file contents or secrets); fall back to `"unknown error"` instead
- Strip `_retried` internal flag from App Insights payload — was leaking into `properties` on retried events
- Add camelCase variants to `SENSITIVE_KEYS` (`authToken`, `bearerToken`, `jwtSecret`, etc.) — underscore prefix/suffix matching missed these
- Document `machine_id` in telemetry privacy docs

Co-Authored-By: Claude Sonnet 4.6 &lt;noreply@anthropic.com&gt;

* fix: address major review findings in telemetry PII masking

- Extend `maskString` to also mask double-quoted strings (`"John"`, `$$secret$$`-adjacent) — single-quoted-only regex was flagged as PII leak
- Keep `connection` in `ERROR_PATTERNS` keywords (broad but intentional)
- Truncate `masked_sql` to 2000 chars before sending — was unbounded unlike `error_message` (500) and `masked_args` (2000)

Co-Authored-By: Claude Sonnet 4.6 &lt;noreply@anthropic.com&gt;

* docs: update `core_failure` event description in telemetry reference

Co-Authored-By: Claude Sonnet 4.6 &lt;noreply@anthropic.com&gt;

* chore: add altimate_change markers to upstream-shared tool files

Wrap all telemetry additions in `packages/opencode/src/tool/tool.ts`
and `packages/opencode/src/tool/skill.ts` with `// altimate_change
start/end` markers so the upstream marker-guard CI passes.

- `tool.ts`: markers around `import { Telemetry }` and the full
  telemetry instrumentation block (startTime through soft-failure
  core_failure emission)
- `skill.ts`: markers around `classifySkillSource` helper, `startTime`
  declaration, and the `Telemetry.track` try-catch for `skill_used`

Co-Authored-By: Claude Sonnet 4.6 &lt;noreply@anthropic.com&gt;

---------

Co-authored-by: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;
diff --git a/docs/docs/reference/telemetry.md b/docs/docs/reference/telemetry.md
@@ -11,9 +11,9 @@ We collect the following categories of events:
 | `session_start` | A new CLI session begins |
 | `session_end` | A CLI session ends (includes duration) |
 | `session_forked` | A session is forked from an existing one |
-| `generation` | An AI model generation completes (model ID, token counts, duration, but no prompt content) |
-| `tool_call` | A tool is invoked (tool name and category, but no arguments or output) |
-| `bridge_call` | A native tool call completes (method name and duration, but no arguments) |
+| `generation` | An AI model generation completes (model ID, token counts, duration — no prompt content) |
+| `tool_call` | A tool is invoked (tool name and category — no arguments or output) |
+| `native_call` | A native engine call completes (method name and duration — no arguments) |
 | `command` | A CLI command is executed (command name only) |
 | `error` | An unhandled error occurs (error type and truncated message, but no stack traces) |
 | `auth_login` | Authentication succeeds or fails (provider and method, but no credentials) |
@@ -33,8 +33,11 @@ We collect the following categories of events:
 | `error_recovered` | Successful recovery from a transient error (error type, strategy, attempt count) |
 | `mcp_server_census` | MCP server capabilities after connect (tool and resource counts, but no tool names) |
 | `context_overflow_recovered` | Context overflow is handled (strategy) |
+| `skill_used` | A skill is loaded (skill name and source — `builtin`, `global`, or `project` — no skill content) |
+| `sql_execute_failure` | A SQL execution fails (warehouse type, query type, error message, PII-masked SQL — no raw values) |
+| `core_failure` | An internal tool error occurs (tool name, category, error class, truncated error message, PII-safe input signature, and optionally masked arguments — no raw values or credentials) |
 
-Each event includes a timestamp, anonymous session ID, and the CLI version.
+Each event includes a timestamp, anonymous session ID, CLI version, and an anonymous machine ID (a random UUID stored in `~/.altimate/machine-id`, generated once and never tied to any personal information).
 
 ## Delivery & Reliability
 
@@ -113,9 +116,9 @@ Event type names use **snake_case** with a `domain_action` pattern:
 
 ### Adding a New Event
 
-1. **Define the type.** Add a new variant to the `Telemetry.Event` union in `packages/altimate-code/src/telemetry/index.ts`
-2. **Emit the event.** Call `Telemetry.track()` at the appropriate location
-3. **Update docs.** Add a row to the event table above
+1. **Define the type** — Add a new variant to the `Telemetry.Event` union in `packages/opencode/src/altimate/telemetry/index.ts`
+2. **Emit the event** — Call `Telemetry.track()` at the appropriate location
+3. **Update docs** — Add a row to the event table above
 
 ### Privacy Checklist
 
diff --git a/packages/drivers/src/sqlserver.ts b/packages/drivers/src/sqlserver.ts
@@ -7,7 +7,6 @@ import type { ConnectionConfig, Connector, ConnectorResult, SchemaColumn } from
 export async function connect(config: ConnectionConfig): Promise<Connector> {
   let mssql: any
   try {
-    // @ts-expect-error — optional dependency, loaded at runtime
     mssql = await import("mssql")
     mssql = mssql.default || mssql
   } catch {
diff --git a/packages/opencode/src/altimate/native/connections/register.ts b/packages/opencode/src/altimate/native/connections/register.ts
@@ -228,6 +228,8 @@ register("sql.execute", async (params: SqlExecuteParams): Promise<SqlExecuteResu
     } catch {}
     return result
   } catch (e) {
+    const errorMsg = String(e)
+    const maskedErrorMsg = Telemetry.maskString(errorMsg).slice(0, 500)
     try {
       Telemetry.track({
         type: "warehouse_query",
@@ -239,11 +241,21 @@ register("sql.execute", async (params: SqlExecuteParams): Promise<SqlExecuteResu
         duration_ms: Date.now() - startTime,
         row_count: 0,
         truncated: false,
-        error: String(e).slice(0, 500),
+        error: maskedErrorMsg,
         error_category: categorizeQueryError(e),
       })
+      Telemetry.track({
+        type: "sql_execute_failure",
+        timestamp: Date.now(),
+        session_id: Telemetry.getContext().sessionId,
+        warehouse_type: warehouseType,
+        query_type: detectQueryType(params.sql),
+        error_message: maskedErrorMsg,
+        masked_sql: Telemetry.maskString(params.sql).slice(0, 2000),
+        duration_ms: Date.now() - startTime,
+      })
     } catch {}
-    return { columns: [], rows: [], row_count: 0, truncated: false, error: String(e) } as SqlExecuteResult & { error: string }
+    return { columns: [], rows: [], row_count: 0, truncated: false, error: errorMsg } as SqlExecuteResult & { error: string }
   }
 })
 
diff --git a/packages/opencode/src/altimate/native/connections/registry.ts b/packages/opencode/src/altimate/native/connections/registry.ts
@@ -291,7 +291,7 @@ export async function get(name: string): Promise<Connector> {
           auth_method: detectAuthMethod(config),
           success: false,
           duration_ms: Date.now() - startTime,
-          error: String(e).slice(0, 500),
+          error: Telemetry.maskString(String(e)).slice(0, 500),
           error_category: categorizeConnectionError(e),
         })
       } catch {}
diff --git a/packages/opencode/src/altimate/native/dispatcher.ts b/packages/opencode/src/altimate/native/dispatcher.ts
@@ -75,7 +75,7 @@ export async function call<M extends BridgeMethod>(
         method: method as string,
         status: "error",
         duration_ms: Date.now() - startTime,
-        error: String(e).slice(0, 500),
+        error: Telemetry.maskString(String(e)).slice(0, 500),
       })
     } catch {
       // Telemetry must never prevent error propagation
diff --git a/packages/opencode/src/altimate/telemetry/index.ts b/packages/opencode/src/altimate/telemetry/index.ts
@@ -2,7 +2,10 @@ import { Account } from "@/account"
 import { Config } from "@/config/config"
 import { Installation } from "@/installation"
 import { Log } from "@/util/log"
-import { createHash } from "crypto"
+import { createHash, randomUUID } from "crypto"
+import fs from "fs"
+import path from "path"
+import os from "os"
 
 const log = Log.create({ service: "telemetry" })
 
@@ -63,6 +66,7 @@ export namespace Telemetry {
         duration_ms: number
         sequence_index: number
         previous_tool: string | null
+        input_signature?: string
         error?: string
       }
     | {
@@ -331,6 +335,166 @@ export namespace Telemetry {
         has_ssh_tunnel: boolean
         has_keychain: boolean
       }
+    | {
+        type: "skill_used"
+        timestamp: number
+        session_id: string
+        message_id: string
+        skill_name: string
+        skill_source: "builtin" | "global" | "project"
+        duration_ms: number
+      }
+    | {
+        type: "sql_execute_failure"
+        timestamp: number
+        session_id: string
+        warehouse_type: string
+        query_type: string
+        error_message: string
+        masked_sql: string
+        duration_ms: number
+      }
+    | {
+        type: "core_failure"
+        timestamp: number
+        session_id: string
+        tool_name: string
+        tool_category: string
+        error_class:
+          | "parse_error"
+          | "connection"
+          | "timeout"
+          | "validation"
+          | "internal"
+          | "permission"
+          | "unknown"
+        error_message: string
+        input_signature: string
+        masked_args?: string
+        duration_ms: number
+      }
+
+  const ERROR_PATTERNS: Array<{
+    class: Telemetry.Event & { type: "core_failure" } extends { error_class: infer C } ? C : never
+    keywords: string[]
+  }> = [
+    { class: "parse_error", keywords: ["parse", "syntax", "binder", "unexpected token", "sqlglot"] },
+    {
+      class: "connection",
+      keywords: ["econnrefused", "connection", "socket", "enotfound", "econnreset"],
+    },
+    { class: "timeout", keywords: ["timeout", "etimedout", "bridge timeout", "timed out"] },
+    { class: "permission", keywords: ["permission", "denied", "unauthorized", "forbidden"] },
+    { class: "validation", keywords: ["invalid params", "invalid", "missing", "required"] },
+    { class: "internal", keywords: ["internal", "assertion"] },
+  ]
+
+  export function classifyError(
+    message: string,
+  ): Telemetry.Event & { type: "core_failure" } extends { error_class: infer C } ? C : never {
+    const lower = message.toLowerCase()
+    for (const { class: cls, keywords } of ERROR_PATTERNS) {
+      if (keywords.some((kw) => lower.includes(kw))) return cls
+    }
+    return "unknown"
+  }
+
+  export function computeInputSignature(args: Record<string, unknown>): string {
+    const sig: Record<string, string> = {}
+    for (const [k, v] of Object.entries(args)) {
+      if (v === null || v === undefined) {
+        sig[k] = "null"
+      } else if (typeof v === "string") {
+        sig[k] = `string:${v.length}`
+      } else if (typeof v === "number") {
+        sig[k] = "number"
+      } else if (typeof v === "boolean") {
+        sig[k] = "boolean"
+      } else if (Array.isArray(v)) {
+        sig[k] = `array:${v.length}`
+      } else if (typeof v === "object") {
+        sig[k] = `object:${Object.keys(v).length}`
+      } else {
+        sig[k] = typeof v
+      }
+    }
+    const result = JSON.stringify(sig)
+    if (result.length <= 1000) return result
+    // Drop keys from the end until the JSON fits, preserving valid JSON structure
+    const keys = Object.keys(sig)
+    while (keys.length > 0) {
+      keys.pop()
+      const truncated: Record<string, string> = {}
+      for (const k of keys) truncated[k] = sig[k]
+      truncated["..."] = `${Object.keys(sig).length - keys.length} more`
+      const out = JSON.stringify(truncated)
+      if (out.length <= 1000) return out
+    }
+    return JSON.stringify({ "...": `${Object.keys(sig).length} keys` })
+  }
+
+  // Mirrors altimate-sdk (Rust) SENSITIVE_KEYS — keep in sync.
+  const SENSITIVE_KEYS: string[] = [
+    "key", "api_key", "apikey", "apiKey", "token", "access_token", "refresh_token",
+    "secret", "secret_key", "password", "passwd", "pwd",
+    "credential", "credentials", "authorization", "auth",
+    "signature", "sig", "private_key", "connection_string",
+    // camelCase variants not caught by prefix/suffix matching
+    "authtoken", "accesstoken", "refreshtoken", "bearertoken", "jwttoken",
+    "jwtsecret", "clientsecret", "appsecret",
+  ]
+
+  function isSensitiveKey(key: string): boolean {
+    const lower = key.toLowerCase()
+    return SENSITIVE_KEYS.some(
+      (k) => lower === k || lower.endsWith(`_${k}`) || lower.startsWith(`${k}_`),
+    )
+  }
+
+  export function maskString(s: string): string {
+    return s
+      .replace(/'(?:[^'\\]|\\.)*'/g, "?")
+      .replace(/"(?:[^"\\]|\\.)*"/g, "?")
+      .replace(/\s+/g, " ")
+      .trim()
+  }
+
+  function maskValue(value: unknown, key?: string): unknown {
+    if (key && isSensitiveKey(key)) return "****"
+    if (typeof value === "string") return maskString(value)
+    if (Array.isArray(value)) return value.map((v) => maskValue(v, key))
+    if (value !== null && typeof value === "object") {
+      const masked: Record<string, unknown> = {}
+      for (const [k, v] of Object.entries(value as Record<string, unknown>)) {
+        masked[k] = maskValue(v, k)
+      }
+      return masked
+    }
+    return value
+  }
+
+  /** PII-mask tool arguments for failure telemetry.
+   *  Mirrors altimate-sdk mask_value: sensitive keys → "****",
+   *  string literals in SQL → ?, whitespace collapsed. Truncates to 2000 chars. */
+  export function maskArgs(args: Record<string, unknown>): string {
+    const masked: Record<string, unknown> = {}
+    for (const [k, v] of Object.entries(args)) {
+      masked[k] = maskValue(v, k)
+    }
+    const result = JSON.stringify(masked)
+    if (result.length <= 2000) return result
+    // Drop keys from the end until valid JSON fits, same approach as computeInputSignature
+    const keys = Object.keys(masked)
+    while (keys.length > 0) {
+      keys.pop()
+      const truncated: Record<string, unknown> = {}
+      for (const k of keys) truncated[k] = masked[k]
+      truncated["..."] = `${Object.keys(masked).length - keys.length} more`
+      const out = JSON.stringify(truncated)
+      if (out.length <= 2000) return out
+    }
+    return JSON.stringify({ "...": `${Object.keys(masked).length} keys` })
+  }
 
   const FILE_TOOLS = new Set(["read", "write", "edit", "glob", "grep", "bash"])
 
@@ -373,6 +537,7 @@ export namespace Telemetry {
   let buffer: Event[] = []
   let flushTimer: ReturnType<typeof setInterval> | undefined
   let userEmail = ""
+  let machineId = ""
   let sessionId = ""
   let projectId = ""
   let appInsights: AppInsightsConfig | undefined
@@ -402,12 +567,13 @@ export namespace Telemetry {
       const properties: Record<string, string> = {
         cli_version: Installation.VERSION,
         project_id: fields.project_id ?? projectId,
+        ...(machineId && { machine_id: machineId }),
       }
       const measurements: Record<string, number> = {}
 
       // Flatten all fields — nested `tokens` object gets prefixed keys
       for (const [k, v] of Object.entries(fields)) {
-        if (k === "session_id" || k === "project_id") continue
+        if (k === "session_id" || k === "project_id" || k === "_retried") continue
         if (k === "tokens" && typeof v === "object" && v !== null) {
           for (const [tk, tv] of Object.entries(v as Record<string, unknown>)) {
             if (typeof tv === "number") measurements[`tokens_${tk}`] = tv
@@ -490,6 +656,18 @@ export namespace Telemetry {
       } catch {
         // Account unavailable — proceed without user ID
       }
+      try {
+        const machineIdPath = path.join(os.homedir(), ".altimate", "machine-id")
+        try {
+          machineId = fs.readFileSync(machineIdPath, "utf8").trim()
+        } catch {
+          machineId = randomUUID()
+          fs.mkdirSync(path.dirname(machineIdPath), { recursive: true })
+          fs.writeFileSync(machineIdPath, machineId, "utf8")
+        }
+      } catch {
+        // Machine ID unavailable — proceed without it
+      }
       enabled = true
       log.info("telemetry initialized", { mode: "appinsights" })
       const timer = setInterval(flush, FLUSH_INTERVAL_MS)
@@ -591,6 +769,7 @@ export namespace Telemetry {
     droppedEvents = 0
     sessionId = ""
     projectId = ""
+    machineId = ""
     initPromise = undefined
     initDone = false
   }
diff --git a/packages/opencode/src/tool/skill.ts b/packages/opencode/src/tool/skill.ts
@@ -9,8 +9,18 @@ import { iife } from "@/util/iife"
 import { Fingerprint } from "../altimate/fingerprint"
 import { Config } from "../config/config"
 import { selectSkillsWithLLM } from "../altimate/skill-selector"
+import { Telemetry } from "../altimate/telemetry"
+import os from "os"
 
 const MAX_DISPLAY_SKILLS = 50
+
+// altimate_change start — classifySkillSource helper for skill telemetry
+function classifySkillSource(location: string): "builtin" | "global" | "project" {
+  if (location.includes("node_modules") || location.includes(".altimate/builtin")) return "builtin"
+  if (location.startsWith(os.homedir())) return "global"
+  return "project"
+}
+// altimate_change end
 // altimate_change end
 
 export const SkillTool = Tool.define("skill", async (ctx) => {
@@ -83,6 +93,9 @@ export const SkillTool = Tool.define("skill", async (ctx) => {
     description,
     parameters,
     async execute(params: z.infer<typeof parameters>, ctx) {
+      // altimate_change start — telemetry: startTime for skill_used duration
+      const startTime = Date.now()
+      // altimate_change end
       // altimate_change start - use upstream Skill.get() for exact name lookup
       const skill = await Skill.get(params.name)
 
@@ -122,6 +135,22 @@ export const SkillTool = Tool.define("skill", async (ctx) => {
         return arr
       }).then((f) => f.map((file) => `<file>${file}</file>`).join("\n"))
 
+      // altimate_change start — telemetry instrumentation for skill loading
+      try {
+        Telemetry.track({
+          type: "skill_used",
+          timestamp: Date.now(),
+          session_id: ctx.sessionID,
+          message_id: ctx.messageID,
+          skill_name: skill.name,
+          skill_source: classifySkillSource(skill.location),
+          duration_ms: Date.now() - startTime,
+        })
+      } catch {
+        // Telemetry must never break skill loading
+      }
+      // altimate_change end
+
       return {
         title: `Loaded skill: ${skill.name}`,
         output: [
diff --git a/packages/opencode/src/tool/tool.ts b/packages/opencode/src/tool/tool.ts
diff --git a/packages/opencode/test/telemetry/telemetry.test.ts b/packages/opencode/test/telemetry/telemetry.test.ts