Skip to content

Optimize critical path update, v15 schema for new (extended) index.#1857

Open
Will Dietz (dtzSiFive) wants to merge 96 commits into
masterfrom
fix/revtop-index-schema-15-chunked
Open

Optimize critical path update, v15 schema for new (extended) index.#1857
Will Dietz (dtzSiFive) wants to merge 96 commits into
masterfrom
fix/revtop-index-schema-15-chunked

Conversation

@dtzSiFive
Copy link
Copy Markdown
Contributor

Chunk the update, as even optimized (~60s -> ~12s) that's still much too long for a single transaction.

Otherwise other wake runs cannot make forward progress for the duration.

Replaces #1850 ; this has a single index that works for various filetree queries that today rely on filesearch, while also being usable for this work.

Use per-run lock files (.wake/locks/run_N.lock) to detect dead wake
processes.  Each wake process creates and holds an exclusive fcntl
lock on its lock file for the duration of the run.  On startup, probe
lock files of incomplete runs -- if we can acquire the lock, the
original process is dead and we mark it as reaped (end_time == -1).

Additionally:
If for whatever reason lock files are missing, to avoid blocking
GC watermark indefinitely assumes such runs are dead if they
started more than 24 hours ago.
This is mostly to cover a strange case that shouldn't happen,
not a part of anyone's normal flow.
use_id only tracked the most recent run using a job, which
breaks when multiple wakes run concurrently - each would
overwrite the other's use_id, corrupting GC decisions.

run_jobs is a junction table (run_id, job_id) allowing many
runs to reference the same job. GC uses a watermark approach:
  watermark = min(run_id where end_time is null) - 1
Jobs whose newest run_id <= watermark are safe to delete,
meaning all runs using them have completed.

Schema changes (migration 9 to 10):
- Remove use_id column from jobs table
- Add run_jobs(run_id, job_id) with CASCADE delete
- Add end_time column to runs (NULL while running)
- Mark existing runs complete so GC works immediately

Query changes:
- Insert into run_jobs on job create and reuse
- Overlap detection joins run_jobs for current run only
- GC queries use watermark instead of use_id filter
- --last, --last-used, --last-executed, digraph filter to
  last completed run (end_time IS NOT NULL)
- Critical path (revtop_order) binds this run's run_id

Adds finish_run() to set end_time before GC runs.
We don't use this information but it might be of value for
debugging or external integration purposes.
Allows us to drop index later if we want.

Per reviewer feedback, thanks!
Don't check error message presently
due to variations across environments.
Second parameter is used first now.

Per reviewer suggestion, thanks!
The schema had 'update entropy set seed=0 where 0;' which was used
to acquire an exclusive lock on database open, preventing any concurrent
access.

This is no longer needed with proper transaction handling.
Refuse to clean while builds are in progress.

First reap dead runs (refactor), then with RW lock held:
* Check if incomplete jobs, exit
* Delete files
The file-based build lock prevented concurrent wake invocations.
This is no longer needed, remove it!
database: add lock files for dead run detection
* schema: 12: -stale, no longer unique on path
* Send in all path info to job prims
* Find files using richer information.
* Consider multiple candidates in reuse_job,
  since find_prior doesn't check specifics of visible.
* Update add_hash for multi-wake
  Don't remove records, but do support new mtime.
* Update detect_overlap, delete_overlap.
  Keep these for time being.
multi-wake: remove .wake-build-lock mechanism
Display whether each run is running, crashed, or duration.
Adds end_time to RunReflection.
Migrated runs show 0s since we lack their actual end time.
Also check that making lock files read-only doesn't prevent
wake --history from determining liveness.
Targeted fix for re2's StringPiece differences across versions.
Record starttime eagerly via start_job() at fork time rather
than writing it in finish_job, so queued jobs (starttime==0) are
distinguishable from running ones.

Also allows showing how long jobs have actually been running.

Add --active, --queued, --in-flight capture filters,
and fix --canceled (now only jobs not part of live runs).

These all are filters on unfinished jobs.
Support RO run lock probing, use in multi-wake's --history.
Don't report jobs that failed but finished.

These coud be produced via: job_fail_launch -> finish_job .
Lists jobs grouped by run_id, showing elapsed time for running jobs
and [queued] for jobs not yet forked.
wake: record starttime early, add capture filters; fix --canceled
Prune unreachable file records, they have no use.

Runs after clearing run_files for this run and delete_jobs.
Ignore errors encountered, instead soldier on!

Errors shouldn't happen, but don't bail if there's a stray file
from a user's editor or the like.
Return usual exit code even if cas GC encounters error.
Drop open_run_jobs from database, unused.
Per reviewer feedback, thanks!
…oncat.

For most use cases we don't need the tags concatenated and grouped
by job_id, which is very slow.

Before, filters such as used by `--ps` were taking over 0.5s!
Now it completes too fast for `\time -v`.
wake: add --ps command to show currently running jobs
Per reviewer suggestion, thanks!
…-and-enumerate_blobs

enable cas unit tests, refactor cas to use constants
The job overhead for wake-stage's trivial amount of work amplifies costs
especially of startup latency when sourcing many files.

Additionally, the point of staging is to get the required information
ASAP so as to not race with concurrent modifications.

Putting into job queue is more overhead than just doing the reflink
directly, and as a primitive we ensure the staging is done immediately
not stuck behind a queue of hashing jobs.

This was especially noticeable with concurrent runs (multi-wake).

Issue diagnosed with help of `wake --ps`! :)
Chunk the update, as even optimized (~60s -> ~12s) that's still
much too long for a single transaction.
@dtzSiFive
Copy link
Copy Markdown
Contributor Author

This chunking, and that added in #1852 , should probably be moved to adaptively chunk based on a time budget instead of magic numbers. Magic time budget instead.

Regardless, these "stop the bleeding" and appear to be sufficient in my testing.

Base automatically changed from feature/stage-as-prim to master May 26, 2026 22:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants