Sync optimization: do existing content check in first stage#4471
Conversation
|
Best viewed with "hide whitespace" in the PR view settings Measured benefit: (Using RHEL 9 BaseOS sync, on-demand) Without Patch
With Patch
That's a 50% decrease! |
8149d4e to
97c422a
Compare
|
@ggainey @pedro-psb I learned last week that the string-interning memory use optimization I made last week was "incomplete" and really only made a full impact on the first sync, because on subsequent syncs most of the same package metadata gets pulled from the database in the This PR fixes that by caching the existing content in the first stage with a restriction on which fields are materialized immediately We don't use any of those fields later in the pipeline, so it's fine. There is a second PR by @jobselko which is complementary to this one - it does the same thing but inside the The reasons to keep this PR also:
|
| "release", | ||
| "arch", | ||
| "size_package", | ||
| "location_href", |
There was a problem hiding this comment.
Needs to respect domain
There was a problem hiding this comment.
What do you mean with "need to respect domain"? It feels safe to assume content in a repository will belong to the domain of that repository, which is enforced early on in the process (in viewset probably)
There was a problem hiding this comment.
You're probably correct but we should just prefetch the domain anyways
There was a problem hiding this comment.
So do you wanna do it here before approval?
There was a problem hiding this comment.
Updated. I just switched to using .defer(). I realized we're probably using evr and other fields somewhere - easier to just exclude ones we know are problematic and expensive than try to figure out exactly what is needed.
pedro-psb
left a comment
There was a problem hiding this comment.
Do you have any data in memory usage?
| "release", | ||
| "arch", | ||
| "size_package", | ||
| "location_href", |
There was a problem hiding this comment.
What do you mean with "need to respect domain"? It feels safe to assume content in a repository will belong to the domain of that repository, which is enforced early on in the process (in viewset probably)
Dramatically reduced. Like 88% in some cases. But that's because the optimization I did 2 months ago for memory use only really worked for the initial sync, and had a muted impact on resyncs - which are the majority of syncs. This fixes that. Strictly speaking @jobselko's patch would also basically accomplish the same thing, but like I mentioned this has the benefit of being more easily backported (Satellite wants this now) and slightly more efficient (because we can avoid the whole model-creation and interning process for packages that were in the cache) Most other plugins probably could not justify a separate implementation though. |
Build a cache of existing packages from the previous repository version. In the event that the repository being created is substantially similar (very likely), this should reduce the amount of database queries required in the QueryExistingContent stage, which can discriminate between Content objects yet-to-be-saved and ones which were already saved. Assisted-By: claude-opus-4.6
|
Weird, the sign on upload test failed on a duplicate Artifact, but on an artifact was generated by signing (which I though was unique each time): 2026-06-03T03:56:08.0869615Z pulp [fb7cde42fc744414a6071ae75c838581]: pulpcore.tasking.tasks:INFO: Starting task id: 019e8b97-0164-7ff3-a4bc-3594b9ef6b10 in domain: default, task_type: pulp_rpm.app.tasks.signing.sign_and_create, immediate: False, deferred: True, worker: 1725@00e04d955322
2026-06-03T03:56:08.0872330Z pulp [fb7cde42fc744414a6071ae75c838581]: pulp_rpm.app.tasks.signing:INFO: Signing package /var/lib/pulp/tmp/1725@00e04d955322/tmp01atsu4h/tmplosfxjgm with fingerprint v4:4F0E406D006AFA08ABAD1CB16219E5C036FC5144.
2026-06-03T03:56:08.0876861Z DETAIL: Key (sha256, pulp_domain_id)=(d13e9957d3b9d909f52ea1b2926af5171abe6437df008c460a65ad1a0a89315c, 28cb5ff4-c27e-41f3-8b92-465b51a35afe) already exists.) will be sanitized in pulpcore 3.130
2026-06-03T03:56:08.0879463Z pulp [fb7cde42fc744414a6071ae75c838581]: pulpcore.tasking.tasks:INFO: Task[pulp_rpm.app.tasks.signing.sign_and_create] 019e8b97-0164-7ff3-a4bc-3594b9ef6b10 failed (IntegrityError: duplicate key value violates unique constraint "core_artifact_sha256_pulp_domain_id_8efff3e8_uniq"
2026-06-03T03:56:08.0881943Z DETAIL: Key (sha256, pulp_domain_id)=(d13e9957d3b9d909f52ea1b2926af5171abe6437df008c460a65ad1a0a89315c, 28cb5ff4-c27e-41f3-8b92-465b51a35afe) already exists.) in domain: default
2026-06-03T03:56:08.0883959Z pulp [fb7cde42fc744414a6071ae75c838581]: pulpcore.tasking.tasks:INFO: File "/usr/local/lib/python3.12/site-packages/pulpcore/tasking/tasks.py", line 81, in _execute_task
2026-06-03T03:56:08.0885107Z result = task_function()
2026-06-03T03:56:08.0885507Z ^^^^^^^^^^^^^^^
2026-06-03T03:56:08.0885746Z
2026-06-03T03:56:08.0886315Z File "/usr/local/lib/python3.12/site-packages/pulp_rpm/app/tasks/signing.py", line 184, in sign_and_create
2026-06-03T03:56:08.0887179Z artifact = _save_artifact(signed_package_path)
2026-06-03T03:56:08.0887691Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-06-03T03:56:08.0888003Z
2026-06-03T03:56:08.0888530Z File "/usr/local/lib/python3.12/site-packages/pulp_rpm/app/tasks/signing.py", line 86, in _save_artifact
2026-06-03T03:56:08.0889283Z artifact.save() |
pedro-psb
left a comment
There was a problem hiding this comment.
There was a transient failure on the other run, but most likely unrelated to this.
Backport to 3.27: 💔 cherry-picking failed — conflicts found❌ Failed to cleanly apply 207655f on top of patchback/backports/3.27/207655ff537671b6ffc2d01e55c03527d4768b98/pr-4471 Backporting merged PR #4471 into main
🤖 @patchback |
Backport to 3.32: 💔 cherry-picking failed — conflicts found❌ Failed to cleanly apply 207655f on top of patchback/backports/3.32/207655ff537671b6ffc2d01e55c03527d4768b98/pr-4471 Backporting merged PR #4471 into main
🤖 @patchback |
Backport to 3.29: 💔 cherry-picking failed — conflicts found❌ Failed to cleanly apply 207655f on top of patchback/backports/3.29/207655ff537671b6ffc2d01e55c03527d4768b98/pr-4471 Backporting merged PR #4471 into main
🤖 @patchback |
Backport to 3.36: 💚 backport PR created✅ Backport PR branch: Backported as #4473 🤖 @patchback |
Backport to 3.35: 💔 cherry-picking failed — conflicts found❌ Failed to cleanly apply 207655f on top of patchback/backports/3.35/207655ff537671b6ffc2d01e55c03527d4768b98/pr-4471 Backporting merged PR #4471 into main
🤖 @patchback |
Build a cache of existing packages from the previous repository version. In the event that the repository being created is substantially similar (very likely), this should reduce the amount of database queries required in the QueryExistingContent stage, which can discriminate between Content objects yet-to-be-saved and ones which were already saved.
Assisted-By: claude-opus-4.6
📜 Checklist
See: Pull Request Walkthrough