Skip to content

VCR WIP [do not merge]#297

Draft
MOZGIII wants to merge 11 commits intomainfrom
mzg/2026-03-26/vcr
Draft

VCR WIP [do not merge]#297
MOZGIII wants to merge 11 commits intomainfrom
mzg/2026-03-26/vcr

Conversation

@MOZGIII
Copy link
Copy Markdown
Collaborator

@MOZGIII MOZGIII commented Mar 30, 2026

This is a PR to make the work I've been doing on the replay feature support on the workers side visible.
The development here is on hold while we're investigating the prod issues.


The basic idea of the implementation is as follows (and some notes):

  1. add two new big subsystems: vcr recorder and vcr playback; recroder to capture the workloads off of waymark-start-workers, and playback to inject the workloads into waymark-start-workers, replacing the real workers pool but using the real postgres backend (just queue instances from the pre-recorded file);
  2. use traits and proper abstraction layers rather than adding the code directly into runloop, or postgres backend; this is for architecture hygiene reasons, but also because the actual goal is to be able to the the underlying code behind those abstraction layers - so it should be as-is now and in in perpetuity for this feature to make sense; as in if we cut corners and add special-casing for it as the backend layer it would defeat the purpose of this work;
  3. there are some supporting crates - like wamark-vcr-file to provides the shared implementation of reading and writing the vcr files, and well as the vcr files format;
  4. the worker pool and backend trait implementation are provided in the separate crates from the waymark-vcr-recorder and waymark-vcr-playback crates in order to keep the integration layer with other major subsystems explicitly lightweight, with a small and limited scope that is easy to test;
  5. during recording we try to keep track of instances and their actions - we want to group the actions by instances so that when we replay we can just loop through the instances and load the actions for each corresponding instance locally, thus avoiding big jumps; there is an unresolved issue with capturing the workflow versions and dags - we don't want to keep a copy of the dag for each of the instances.
  6. another issue is with correlating the recorded actions (executions) with the replay executions; we want to be able to match the actions we have recorded with the actions for each executor, and the easiest way of doing this would be via correlating the execution ID - but those are generated on-the-fly when the node is added for execution, so this is unsolved for now and might require some more work to either make those IDs be computed deterministically (i.e. from an instance id, graph node id, iteration of the loop unwinding, and an attempt number) and to have more type separation to distinguish between the node and the execution id.
  7. a general refactor of QueuedInstance type and associated backend/executor operations would simplify building this feature, but not a blocker.

UPD: added numbering so it's easier to reference items in discussion.

@MOZGIII MOZGIII force-pushed the mzg/2026-03-26/vcr branch 7 times, most recently from 6e2bde6 to 97c110d Compare April 10, 2026 11:42
@MOZGIII MOZGIII changed the title Replay support in workers WIP [do not merge] VCR WIP [do not merge] Apr 10, 2026
@MOZGIII MOZGIII force-pushed the mzg/2026-03-26/vcr branch 5 times, most recently from 02b709a to 53f4216 Compare April 13, 2026 06:36
@github-actions
Copy link
Copy Markdown

Coverage Report

Python Coverage

Metric Coverage
Lines 72.0%
Branches 58.0%

Download HTML Report

Rust Coverage

Metric Coverage
Lines 64.5% 🔴 (-1.6%)
Branches N/A

Download HTML Report

Compared to main branch

@MOZGIII MOZGIII force-pushed the mzg/2026-03-26/vcr branch from b3e9ce9 to 14aec6e Compare April 13, 2026 08:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant