observability for a psyche run is fragmented into many parts
a TUI, OTLP tracing, tracing log messages for docker tests with specific structure, a tcp metrics server, logging, iroh metrics.
debugging the system has been attempted through these various channels but it's a bit of a nightmare.
we're overhauling the entire psyche metrics system to use event sourcing.
the problem
when something goes wrong in this distributed system, we have to answer questions like "node #5 says they sent their training results but node #3 never got them. why did this happen?" or "the witness quorum never formed. which witnesses even saw what data?"
trying to answer these questions right now means we have to scan through noisy logs from multiple systems / multiple hosts, correlating timestamps, and trying to reconstruct what happened from these vague logs and aggregate metrics.
the solution
record every event required to reconstruct a run as structured data: events with timestamp and context that let us trace a piece of data through the entire system.
this would in theory let us have a global timeline from an entire run that lets us scrub back and forth to view a global state at any point.
we're going to capture:
- every coordinator state change
- every network operation (connection changes, blob transfers, gossip messages)
- every training action (training data downloads, start/complete training)
- model loading / checkpointing
- snapshots of system resources
- attempts to write data to the coordinator
this design would let us switch to a database or something later for actual timeline analysis. the focus is on getting the events correct first.
this ticket doesn't specify how we'll emit events - for now these could just be logs, the ticket focus on what structured data we actually want.
events are facts. they are not derived metrics. we don't need to emit anything that can be derived.
we only track our own actions and observations only we can make. we treat the coordinator as a black box, and only note that we saw it change (when / what it changed to), so that we don't have to spam the entire coordinator struct into logs. (we can reconstruct its state completely from the solana tx history)
specification
run context (emitted once at start)
run_started
- run_id
- node_id
- node_role (trainer/witness/etc)
- config_hash or key config params
- psyche_version
- start_timestamp
each subsequent event will include
tags expected on events below are denoted in (), unique fields for that event are listed
specific event types
coordinator
saw coordinator state change
submitted event (unique req tag)
- type: witness / health check / optimistic warmup etc
event submission complete (unique req tag)
- success/fail
- error_string if fail
network and p2p events
p2p connection changed
- endpoint_id
- connection_path (direct vs relayed vs disconnected)
gossip neighbors changed
p2p peer latency changed
blob added to store (tag for what piece of data (batch # / p2p param / whatever) it is)
blob upload started (unique tag for this upload, tag for what piece of data it is)
- to_endpoint_id
- size
- retry #
blob upload progress (unique tag) - only emit if >5s since last progress event
blob upload completed (unique tag)
- success/fail
- error_string if fail
blob download started (unique tag for this download, tag for what piece of data it is, tag for its hash)
- from_endpoint_id
- size
- retry #?
blob download progress (unique tag) - only emit if >5s since last progress event
blob download completed (unique tag)
- success/fail
- error_string if fail
gossip message sent (specific data tag, batch id?, blob hash?)
- message_type
- message_size
- # rebroadcast
new gossip message recv'd (same tags)
- message_type
- message_size
training events
batches assignment received (multiple batch ids)
batch data download start (batch id)
batch data download progress (batch id) - only emit if >5s since last progress event
batch data download complete (batch id)
- success/fail
- error_string if fail
training started (batch id)
training finished (batch id)
distro result deserialize started (batch id)
distro result deserialize completed (batch id)
- success/fail
- error_string if fail
applying distro results (list of batch ids)
warmup
p2p param requested from peer
- success/fail
- error_string if fail
checkpoint download started(tag)
checkpoint download progress - only emit if >5s since last progress event (tag)
checkpoint download complete (tag)
- success/fail
- error_string if fail
cooldown
model serialization started
model serialization finished
- success/fail
- error_string if fail
checkpoint write started
checkpoint write finished
- success/fail
- error_string if fail
checkpoint upload started
checkpoint upload progress - only emit if >5s since last progress event
checkpoint upload finished
- success/fail
- error_string if fail
resource snapshots
emitted on a fixed interval (30s?)
resource snapshot
- gpu_mem_used
- gpu_utilization
- cpu_mem_used
- cpu_utilization
- network_bytes_sent_total
- network_bytes_recv_total
- disk_space_available
observability for a psyche run is fragmented into many parts
a TUI, OTLP tracing, tracing log messages for docker tests with specific structure, a tcp metrics server, logging, iroh metrics.
debugging the system has been attempted through these various channels but it's a bit of a nightmare.
we're overhauling the entire psyche metrics system to use event sourcing.
the problem
when something goes wrong in this distributed system, we have to answer questions like "node #5 says they sent their training results but node #3 never got them. why did this happen?" or "the witness quorum never formed. which witnesses even saw what data?"
trying to answer these questions right now means we have to scan through noisy logs from multiple systems / multiple hosts, correlating timestamps, and trying to reconstruct what happened from these vague logs and aggregate metrics.
the solution
record every event required to reconstruct a run as structured data: events with timestamp and context that let us trace a piece of data through the entire system.
this would in theory let us have a global timeline from an entire run that lets us scrub back and forth to view a global state at any point.
we're going to capture:
this design would let us switch to a database or something later for actual timeline analysis. the focus is on getting the events correct first.
this ticket doesn't specify how we'll emit events - for now these could just be logs, the ticket focus on what structured data we actually want.
events are facts. they are not derived metrics. we don't need to emit anything that can be derived.
we only track our own actions and observations only we can make. we treat the coordinator as a black box, and only note that we saw it change (when / what it changed to), so that we don't have to spam the entire coordinator struct into logs. (we can reconstruct its state completely from the solana tx history)
specification
run context (emitted once at start)
run_started
each subsequent event will include
tags expected on events below are denoted in (), unique fields for that event are listed
specific event types
coordinator
saw coordinator state change
submitted event (unique req tag)
event submission complete (unique req tag)
network and p2p events
p2p connection changed
gossip neighbors changed
p2p peer latency changed
blob added to store (tag for what piece of data (batch # / p2p param / whatever) it is)
blob upload started (unique tag for this upload, tag for what piece of data it is)
blob upload progress (unique tag) - only emit if >5s since last progress event
blob upload completed (unique tag)
blob download started (unique tag for this download, tag for what piece of data it is, tag for its hash)
blob download progress (unique tag) - only emit if >5s since last progress event
blob download completed (unique tag)
gossip message sent (specific data tag, batch id?, blob hash?)
new gossip message recv'd (same tags)
training events
batches assignment received (multiple batch ids)
batch data download start (batch id)
batch data download progress (batch id) - only emit if >5s since last progress event
batch data download complete (batch id)
training started (batch id)
training finished (batch id)
distro result deserialize started (batch id)
distro result deserialize completed (batch id)
applying distro results (list of batch ids)
warmup
p2p param requested from peer
checkpoint download started(tag)
checkpoint download progress - only emit if >5s since last progress event (tag)
checkpoint download complete (tag)
cooldown
model serialization started
model serialization finished
checkpoint write started
checkpoint write finished
checkpoint upload started
checkpoint upload progress - only emit if >5s since last progress event
checkpoint upload finished
resource snapshots
emitted on a fixed interval (30s?)
resource snapshot