Universal control plane: live queue + full-trace observability/audit that every interaction flows through

## Vision (Jay 2026-06-14)
One control plane where you can SEE and CONTROL everything taOS does. **Every human and agent interaction goes through it** -- chat turns, agent runs, tool calls, image gen, memory ops, model requests. It is the observability + audit + governance spine, with live queue control on top.

### Must-haves
- **Live queue view**: see the scheduler queue in real time (what is queued / running / draining, depth, per-resource).
- **On-the-fly control**: redirect a request to a different resource/backend mid-flight; throttle the queue (insert a delay between tasks); pause/resume; cancel.
- **Full trace per task**: the messages, logs, tool calls, WHICH USER requested it, what model + provider, timing, and the agent's reasoning/decision context (not just prompts/tokens/latency).
- **Modern, organised UI**: clean live list; double-click a task for the full drill-down (timeline, tool-call tree, inputs/outputs, model/provider/user, cost, errors).

## Build on what exists (do NOT rebuild)
- `tinyagentos/scheduler/history_store.py` -- the queue's dispatch history (already feeds the Activity app).
- `tinyagentos/scheduler/scheduler.py` -- the dispatcher (Phase 1; queue control = the unbuilt Phase 2, see #894).
- `tinyagentos/trace_store.py` -- trace storage.
- `tinyagentos/otel/receiver.py` + `emitter.py` + `judge.py` -- the OTLP push/receive layer the agent-audit design specified.
- `desktop/src/apps/ActivityApp.tsx` -- the existing Activity UI to evolve into this control plane.

## The mandate (universal coverage)
A single instrumentation point every interaction passes through, so there is no blind path: chat (taos_agent), agent runs (deployed frameworks via the adapters), tool calls (skill_exec), image gen (scheduler -> backends), memory ops, model/provider requests (LiteLLM). This universal coverage is what unlocks the audit value the agent community wants: catch 'silent success' (agent reasons wrongly but the run looks green), answer 'why did the agent do that?', per-user attribution, PII redaction, audit trails that stay inside the self-hosted perimeter.

## Control surface (new, Phase 2)
- Redirect: re-target a queued/running task's resource (ties to the scheduler resource model + #894 admission).
- Throttle: configurable inter-task delay / rate limit per resource or globally (back-pressure knob).
- These need the scheduler to own a real queue first (#894).

## Tie-ins
- #894 scheduler Phase 2 (the queue this visualizes + controls; admission/eviction).
- agent-audit/observability layer (the CONVERGED design with @taOSmd: OTLP-push, semconv v0, decision/governance context, PII redaction). This issue is the umbrella; detailed sub-scoping is done WITH @taOSmd per that design.
- #890/#892 (worker/cluster) -- the queue + traces span cluster nodes, not just the controller.

## Process
Umbrella + vision capture. Brainstorm -> spec with @taOSmd -> phased build (read-only live view first, then drill-down, then control surface). Do not block the storybook demo.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Universal control plane: live queue + full-trace observability/audit that every interaction flows through #896

Vision (Jay 2026-06-14)

Must-haves

Build on what exists (do NOT rebuild)

The mandate (universal coverage)

Control surface (new, Phase 2)

Tie-ins

Process

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Universal control plane: live queue + full-trace observability/audit that every interaction flows through #896

Description

Vision (Jay 2026-06-14)

Must-haves

Build on what exists (do NOT rebuild)

The mandate (universal coverage)

Control surface (new, Phase 2)

Tie-ins

Process

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions