Skip to content

Refresh on re-run for completed pull sub-commands#260

Closed
epeicher wants to merge 4 commits into
trunkfrom
delta-resync-refresh
Closed

Refresh on re-run for completed pull sub-commands#260
epeicher wants to merge 4 commits into
trunkfrom
delta-resync-refresh

Conversation

@epeicher

Copy link
Copy Markdown
Collaborator

Problem

Re-pulling a site never propagated edits or deletions to existing database rows — the local database stayed frozen at the content captured by the first completed pull. The root cause is an asymmetry in the engine: the composite pull orchestrator resets state for a delta re-pull (Pull::prepare_repull()), but callers that drive the sub-commands directly (notably Studio's pull-reprint) had no refresh path:

Sub-command Re-run on completed state (before)
files-pull Silent no-op ("already complete… use --abort")
db-pull Throws "already completed … Use --abort flag to start over."
db-apply Throws "already completed. Use --abort flag to re-run."

--abort is destructive (re-downloads everything), so there was no clean incremental re-sync.

Change

Re-invoking a sub-command on a completed state now performs a refresh, matching the composite pull's re-pull semantics:

  • files-pull → delta sync. Keeps the local index, re-indexes the remote, diffs, fetches only changes. Stale remote index / download lists are cleared first (download_remote_index() appends when the file exists).
  • db-pull → full refresh. New ImportClient::reset_db_pull_state() (now also used by Pull::prepare_repull()) resets DB state and deletes db.sql, .import-domains.json, and the .sql-buffer crash-recovery file, then re-downloads the entire dump.
  • db-apply → re-apply from statement 0. The dump is idempotent (DROP TABLE IF EXISTS + full INSERTs), so edits, inserts, and deletes all land.

In-progress/partial states still resume from their cursors; --abort still clears state without running.

Bugs fixed along the way

  1. --filter=skipped-earlier never marked completion — the branch returned before the status = "complete" marking, leaving every successful essential+skipped pull stuck in_progress forever. Fixed, plus a legacy-state repair in run() for sessions written by older versions.
  2. Deleting the skipped download list broke running runtimesapply-runtime mounts state/.import-download-list-skipped.jsonl into the generated runtime (remote-uploads proxy), and PHP-WASM re-resolves mounts on every runtime rotation. Deleting the file crashed server starts and WP-CLI with ENOENT. It is now truncated to empty instead (all has-skipped checks test size > 0, so semantics are unchanged).
  3. Phantom "X is currently editing" badges — the dump carries the remote's ephemeral _edit_lock postmeta, which renders as an editor lock on the imported copy for 150s after the captured timestamp. db-apply now strips _edit_lock rows after applying (WordPress regenerates them locally; core's WXR exporter skips them too).
  4. Stale .sql-buffer replay — a leftover crash-recovery buffer could be replayed into the target on a refresh (latent prepare_repull() gap, fixed via the shared reset helper).

Known limitation

A table dropped on the remote between pulls lingers in the local target (the new dump no longer mentions it). WP core tables always exist, so this is rare; recreating the SQLite file on refresh is a possible hardening follow-up.

Testing

  • New tests/Import/DbResyncStateTest.php: round-trip via the sub-command path proving an edited, inserted, and deleted row all propagate on re-run; db-pull refresh resets state and deletes stale artifacts; the stuck skipped-earlier state is repaired; _edit_lock is stripped while other postmeta survives.
  • tests/Import/FilesSyncStateTest.php updated: completed files-pull re-runs as a delta and clears the stale remote index.
  • Full tests/Import/ + tests/UrlRewriting/ suites green (493 tests).
  • Verified live end-to-end against a WP.com Atomic site via Studio: remote post edits, inserts, and deletes propagate on re-pull; site start and WP-CLI work after sync cycles.

Supersedes #257 — moved from a fork to an in-repo branch so CI runs with full token permissions (the Performance Report perf-comment step can post on this PR).

epeicher added 3 commits June 10, 2026 20:06
Re-invoking files-pull, db-pull, or db-apply on a completed state now
performs a refresh instead of erroring or no-oping, matching the
composite pull orchestrator's re-pull behavior:

- files-pull: runs a delta sync — keeps the local index, re-indexes the
  remote, diffs, and fetches only changes. The stale remote index and
  download lists are cleared first (download_remote_index() appends
  when the file exists).
- db-pull: full refresh via the new ImportClient::reset_db_pull_state()
  (also reused by Pull::prepare_repull()) — resets DB state, deletes
  db.sql, .import-domains.json, and the .sql-buffer crash-recovery
  file, then re-downloads the entire dump.
- db-apply: re-applies the dump from statement 0. The dump is
  idempotent (DROP TABLE IF EXISTS + full INSERTs), so remote edits,
  inserts, and deletes all propagate.

This unblocks callers that drive the sub-commands directly (notably
Studio's pull-reprint) and could previously never re-sync a database:
edits and deletions to existing rows stayed frozen at the first
completed pull.

Also fixed along the way:

- --filter=skipped-earlier runs never marked status=complete, leaving
  the state stuck in_progress after a successful pull. The branch now
  marks completion, and a legacy-state repair in run() heals sessions
  written by older versions.
- The skipped download list is now truncated instead of deleted
  everywhere. apply-runtime mounts that file into the generated
  runtime for the remote-uploads proxy, and PHP-WASM re-resolves
  mounts on every runtime rotation — deleting the file crashed server
  starts and WP-CLI with ENOENT.
- db-apply strips _edit_lock postmeta after applying. The dump carries
  the remote's ephemeral editor locks, which render as phantom
  "X is currently editing" badges on the imported copy.
- A stale .sql-buffer could be replayed into the target on refresh
  (latent prepare_repull bug, fixed by the shared reset helper).

In-progress/partial states still resume from their cursors; --abort
remains the way to clear state without running.
Three E2E tests asserted the old "re-run refuses / deletes" contract
that this branch changes; update them to verify the new behavior:

- import-02: re-running db-sync on a completed state now performs a
  full refresh (exit 0, fresh db.sql, audit log records the refresh)
  instead of erroring with a "--abort" message.
- import-40: the skipped download list is truncated to empty rather
  than deleted (apply-runtime mounts it; deletion crashed the runtime
  with ENOENT). Assert it exists and is zero-length.
- import-01: a completed files-sync re-run now performs a delta sync;
  rename the test and assert the completed status it ends on (it was
  passing only incidentally under the old "refuses" name).

Verified in the Dockerized E2E environment: all three files pass
(31 tests). The full suite's only remaining failures (import-07,
import-19) are pre-existing local-Docker permission/fault-injection
flakiness — they fail identically on trunk and pass on CI.
@github-actions

github-actions Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Pull pipeline performance — large-directory

Site: large-directory · 2,000+ plus targeted file-transfer scenarios files · 10,000 posts · 25,000 postmeta · PHP 8.5.7

Stage PR trunk Δ Status Details
playground-sqlite-db-pull 9.38 s 9.84 s ⚪ -455 ms (-4.6%) condition=db-pull in PHP.wasm
runtime=php.wasm 8.3
wp_mysql_parser=enabled
mode=lexer
native_lexer=verified
native_token_stream=WP_MySQL_Native_Token_Stream
native_token_count=18
native_parser=selected
trunk: condition=db-pull in PHP.wasm
runtime=php.wasm 8.3
wp_mysql_parser=enabled
mode=lexer
native_lexer=verified
native_token_stream=WP_MySQL_Native_Token_Stream
native_token_count=18
native_parser=selected
playground-sqlite-db-apply 3.53 s 3.51 s ⚪ +20 ms (+0.6%) condition=db-apply to SQLite in PHP.wasm
runtime=php.wasm 8.3
wp_mysql_parser=enabled
mode=parser
native_lexer=verified
native_token_stream=WP_MySQL_Native_Token_Stream
native_token_count=18
native_parser=verified
native_ast=WP_MySQL_Native_Parser_Node
sqlite_driver_parser=verified
trunk: condition=db-apply to SQLite in PHP.wasm
runtime=php.wasm 8.3
wp_mysql_parser=enabled
mode=parser
native_lexer=verified
native_token_stream=WP_MySQL_Native_Token_Stream
native_token_count=18
native_parser=verified
native_ast=WP_MySQL_Native_Parser_Node
sqlite_driver_parser=verified
Total 12.92 s 13.35 s ⚪ -435 ms (-3.3%)

Numbers carry runner noise; treat single-run deltas as directional, not authoritative.

📈 Trunk performance history — commit-by-commit timeline.


$this->state = $this->load_state();

// Legacy state repair: older versions never marked a

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to document what older versions did inline. It only matters for reviewing this PR but will not matter once we merge it. Let's just document what this does now.

!file_exists($this->skipped_download_list_file)
) {
$this->audit_log(
"STATE REPAIR | skipped-earlier run finished but was never marked complete — marking complete",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we mark it as complete right before deleting the skipped_download_list_file? As this stands now, a hypothetical premature deletion of the skipped_download_list_file would keep reprint running where I'd expect it to stop and communicate the error.

Comment on lines +10269 to +10273
@@ -10188,6 +10270,68 @@ function ($ch, $dl_total, $dl_now, $ul_total, $ul_now) {
* Reset state to defaults while preserving cross-command data like
* preflight results, version, and follow_symlinks.
*/
/**

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note the "Reset state" comment got misplaced now. There's also a misplaced comment before it already, let's find its home or delete it.

@unlink($path);
}
}
// Truncated, not deleted: the generated runtime may hold a mount

@adamziel adamziel Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does that mean? Is it for running in Studio? If so, let's clearly explain that scenario in the doc string. What's the setup? What's the problem? How is this solving the problem?

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR changes the sub-command re-run behavior so that invoking files-pull, db-pull, or db-apply on a completed state now performs a refresh (delta sync / full re-dump / re-apply) instead of erroring with "use --abort." This aligns sub-command behavior with the composite pull orchestrator's re-pull semantics and fixes several related bugs along the way.

Changes:

  • Sub-commands (files-pull, db-pull, db-apply) now detect a completed state and re-run as a refresh (delta or full), with shared helpers (reset_db_pull_state(), clear_skipped_download_list()) extracted to avoid duplication.
  • Fixes for the --filter=skipped-earlier completion bug (never marked complete), truncation-vs-deletion for the skipped download list (prevents ENOENT in running runtimes), phantom _edit_lock badges (stripped after apply), and stale .sql-buffer replay.
  • Comprehensive test coverage added: new DbResyncStateTest.php with round-trip, edit-lock stripping, db-pull reset, and legacy state repair tests; updated FilesSyncStateTest.php and e2e tests to validate the new refresh behavior.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
packages/reprint-importer/src/import.php Core implementation: legacy state repair, delta/refresh paths for all three sub-commands, clear_skipped_download_list() and reset_db_pull_state() helpers, strip_ephemeral_edit_locks(), and CLI help updates
packages/reprint-importer/src/lib/pull/class-pull.php Refactored prepare_repull() to use the shared reset_db_pull_state() helper
tests/Import/DbResyncStateTest.php New test class covering db-apply re-run, edit-lock stripping, db-pull reset, and legacy state repair
tests/Import/FilesSyncStateTest.php Updated test for completed files-pull re-run (delta instead of refusal)
tests/e2e/tests/import-01-basic-file-sync.test.js Updated e2e test: re-run now expects successful delta sync
tests/e2e/tests/import-02-sql-sync.test.js Updated e2e test: re-run expects successful refresh
tests/e2e/tests/import-40-defer-uploads.test.js Updated e2e test: skipped list validated as truncated (not deleted)
CLAUDE.md Documents the new re-run/refresh semantics

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +10273 to +10294
/**
* Empty the skipped-files download list by truncating it in place
* rather than deleting it.
*
* apply-runtime mounts this file into the generated runtime — the
* temporary remote-uploads proxy reads it to decide which paths are
* still remote-only. A running server re-resolves its mounts on every
* PHP runtime rotation (and on restart from a persisted config), so
* deleting the file breaks the mount with ENOENT. An empty file means
* "no remote-only files left", which is exactly what every caller
* wants — all has-skipped checks test for size > 0.
*/
public function clear_skipped_download_list(string $reason): void
{
if (!file_exists($this->skipped_download_list_file)) {
return;
}
file_put_contents($this->skipped_download_list_file, "");
$this->audit_log(
"FILE TRUNCATE | {$this->skipped_download_list_file} | {$reason}",
);
}
$this->save_state($this->state);
$this->run_files_sync_pipeline();

// Pipeline returns early with partial status if interrupted.

@adamziel adamziel Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would we run into this? I'm a bit confused and the comment doesn't tell me what are we guarding against. What is an interruption in this comment?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, I think I get it now. I'll brush up these comments.

// the fresh-start path below, where $is_delta picks up the local
// index and turns the run into re-index → diff → fetch changes.
$this->audit_log(
sprintf("files-pull already complete: %d files indexed%s", $index_size, $skipped_note),

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is valuable audit log information, let's keep it

@adamziel

Copy link
Copy Markdown
Collaborator

With this PR, it's no longer safe to re-run completed commands files-pull, db-pull, db-apply as they will implicitly restart the work. They intentionally require an explicit restart intent via --abort.

I understand the problem you're solving here is refreshing the local database from the remote site? Wouldn't calling db-pull --abort already do that?

@epeicher

Copy link
Copy Markdown
Collaborator Author

With this PR, it's no longer safe to re-run completed commands files-pull, db-pull, db-apply as they will implicitly restart the work. They intentionally require an explicit restart intent via --abort.

I understand the problem you're solving here is refreshing the local database from the remote site? Wouldn't calling db-pull --abort already do that?

Agreed, good catch. You're right that db-pull --abort && db-pull already does the refresh today (same for db-apply, and files-pull --abort keeps the index so its re-run is a delta), so this was changing the default rather than adding a capability, and the explicit intent guard is worth keeping.

I'll back out the implicit restart and keep the refresh behind --abort. Do you think adding a --refresh flag that applies db-pull --abort && db-pull is more ergonomic?

@epeicher

Copy link
Copy Markdown
Collaborator Author

Additionally, I can create follow-up PRs for the following identified issues:

  1. Truncate the skipped list instead of deleting it (the ENOENT fix)
  2. Mark --filter=skipped-earlier runs complete
  3. Strip _edit_lock on db-apply
  4. Clear .sql-buffer in prepare_repull

Then the refresh PR (flag or --abort) lands last on top.

@epeicher

Copy link
Copy Markdown
Collaborator Author

Closing this in favor of the following PRs:

And the most important one is the changes in Studio:

A couple of bugs found during this PR are not real issues so they have been closed:

@epeicher epeicher closed this Jun 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants