Skip to content

Sync optimization: do existing content check in first stage#4471

Merged
pedro-psb merged 1 commit into
pulp:mainfrom
dralley:sync-cache
Jun 3, 2026
Merged

Sync optimization: do existing content check in first stage#4471
pedro-psb merged 1 commit into
pulp:mainfrom
dralley:sync-cache

Conversation

@dralley
Copy link
Copy Markdown
Contributor

@dralley dralley commented May 28, 2026

Build a cache of existing packages from the previous repository version. In the event that the repository being created is substantially similar (very likely), this should reduce the amount of database queries required in the QueryExistingContent stage, which can discriminate between Content objects yet-to-be-saved and ones which were already saved.

Assisted-By: claude-opus-4.6

📜 Checklist

  • Commits are cleanly separated with meaningful messages (simple features and bug fixes should be squashed to one commit)
  • A changelog entry or entries has been added for any significant changes
  • Follows the Pulp policy on AI Usage
  • (For new features) - User documentation and test coverage has been added

See: Pull Request Walkthrough

@dralley
Copy link
Copy Markdown
Contributor Author

dralley commented May 28, 2026

Best viewed with "hide whitespace" in the PR view settings

Measured benefit:

(Using RHEL 9 BaseOS sync, on-demand)

Without Patch

  • First sync: 113 seconds
  • Second sync: 103 seconds

With Patch

  • First sync: 109 seconds (just noise, no real change)
  • Second sync: 51 seconds

That's a 50% decrease!

@dralley dralley marked this pull request as ready for review June 1, 2026 18:54
@dralley dralley force-pushed the sync-cache branch 2 times, most recently from 8149d4e to 97c422a Compare June 1, 2026 19:18
@dralley
Copy link
Copy Markdown
Contributor Author

dralley commented Jun 1, 2026

@ggainey @pedro-psb I learned last week that the string-interning memory use optimization I made last week was "incomplete" and really only made a full impact on the first sync, because on subsequent syncs most of the same package metadata gets pulled from the database in the QueryExistingContents stage.

This PR fixes that by caching the existing content in the first stage with a restriction on which fields are materialized immediately .only("name", "epoch", "version", ...) and swapping them out early, thereby bypassing the QueryExistingContents phase since it doesn't check models that are already marked as being in the database.

We don't use any of those fields later in the pipeline, so it's fine.

There is a second PR by @jobselko which is complementary to this one - it does the same thing but inside the QueryExistingContents stage itself. pulp/pulpcore#7754

The reasons to keep this PR also:

  • We can backport it more easily, and it's a significant enough improvement to be worthy of backporting
  • In the case of RPM, doing the swap earlier means that there's a ton of strings that we can avoid materializing in Python entirely, rather than extracting them, doing all of the interning and then immediately throwing them away in the next stage

Comment thread pulp_rpm/app/tasks/synchronizing.py Outdated
"release",
"arch",
"size_package",
"location_href",
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to respect domain

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean with "need to respect domain"? It feels safe to assume content in a repository will belong to the domain of that repository, which is enforced early on in the process (in viewset probably)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're probably correct but we should just prefetch the domain anyways

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So do you wanna do it here before approval?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I need to push

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. I just switched to using .defer(). I realized we're probably using evr and other fields somewhere - easier to just exclude ones we know are problematic and expensive than try to figure out exactly what is needed.

Copy link
Copy Markdown
Member

@pedro-psb pedro-psb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any data in memory usage?

Comment thread pulp_rpm/app/tasks/synchronizing.py Outdated
"release",
"arch",
"size_package",
"location_href",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean with "need to respect domain"? It feels safe to assume content in a repository will belong to the domain of that repository, which is enforced early on in the process (in viewset probably)

@dralley
Copy link
Copy Markdown
Contributor Author

dralley commented Jun 2, 2026

Do you have any data in memory usage?

Dramatically reduced. Like 88% in some cases. But that's because the optimization I did 2 months ago for memory use only really worked for the initial sync, and had a muted impact on resyncs - which are the majority of syncs. This fixes that.

Strictly speaking @jobselko's patch would also basically accomplish the same thing, but like I mentioned this has the benefit of being more easily backported (Satellite wants this now) and slightly more efficient (because we can avoid the whole model-creation and interning process for packages that were in the cache)

Most other plugins probably could not justify a separate implementation though.

Build a cache of existing packages from the previous repository version.
In the event that the repository being created is substantially similar
(very likely), this should reduce the amount of database queries
required in the QueryExistingContent stage, which can discriminate
between Content objects yet-to-be-saved and ones which were already
saved.

Assisted-By: claude-opus-4.6
@pedro-psb
Copy link
Copy Markdown
Member

Weird, the sign on upload test failed on a duplicate Artifact, but on an artifact was generated by signing (which I though was unique each time):

2026-06-03T03:56:08.0869615Z pulp [fb7cde42fc744414a6071ae75c838581]: pulpcore.tasking.tasks:INFO: Starting task id: 019e8b97-0164-7ff3-a4bc-3594b9ef6b10 in domain: default, task_type: pulp_rpm.app.tasks.signing.sign_and_create, immediate: False, deferred: True, worker: 1725@00e04d955322
2026-06-03T03:56:08.0872330Z pulp [fb7cde42fc744414a6071ae75c838581]: pulp_rpm.app.tasks.signing:INFO: Signing package /var/lib/pulp/tmp/1725@00e04d955322/tmp01atsu4h/tmplosfxjgm with fingerprint v4:4F0E406D006AFA08ABAD1CB16219E5C036FC5144.
2026-06-03T03:56:08.0876861Z DETAIL:  Key (sha256, pulp_domain_id)=(d13e9957d3b9d909f52ea1b2926af5171abe6437df008c460a65ad1a0a89315c, 28cb5ff4-c27e-41f3-8b92-465b51a35afe) already exists.) will be sanitized in pulpcore 3.130
2026-06-03T03:56:08.0879463Z pulp [fb7cde42fc744414a6071ae75c838581]: pulpcore.tasking.tasks:INFO: Task[pulp_rpm.app.tasks.signing.sign_and_create] 019e8b97-0164-7ff3-a4bc-3594b9ef6b10 failed (IntegrityError: duplicate key value violates unique constraint "core_artifact_sha256_pulp_domain_id_8efff3e8_uniq"
2026-06-03T03:56:08.0881943Z DETAIL:  Key (sha256, pulp_domain_id)=(d13e9957d3b9d909f52ea1b2926af5171abe6437df008c460a65ad1a0a89315c, 28cb5ff4-c27e-41f3-8b92-465b51a35afe) already exists.) in domain: default
2026-06-03T03:56:08.0883959Z pulp [fb7cde42fc744414a6071ae75c838581]: pulpcore.tasking.tasks:INFO:   File "/usr/local/lib/python3.12/site-packages/pulpcore/tasking/tasks.py", line 81, in _execute_task
2026-06-03T03:56:08.0885107Z     result = task_function()
2026-06-03T03:56:08.0885507Z              ^^^^^^^^^^^^^^^
2026-06-03T03:56:08.0885746Z 
2026-06-03T03:56:08.0886315Z   File "/usr/local/lib/python3.12/site-packages/pulp_rpm/app/tasks/signing.py", line 184, in sign_and_create
2026-06-03T03:56:08.0887179Z     artifact = _save_artifact(signed_package_path)
2026-06-03T03:56:08.0887691Z                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-06-03T03:56:08.0888003Z 
2026-06-03T03:56:08.0888530Z   File "/usr/local/lib/python3.12/site-packages/pulp_rpm/app/tasks/signing.py", line 86, in _save_artifact
2026-06-03T03:56:08.0889283Z     artifact.save()

Copy link
Copy Markdown
Member

@pedro-psb pedro-psb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a transient failure on the other run, but most likely unrelated to this.

@pedro-psb pedro-psb merged commit 207655f into pulp:main Jun 3, 2026
24 of 27 checks passed
@dralley dralley deleted the sync-cache branch June 3, 2026 12:51
@patchback
Copy link
Copy Markdown

patchback Bot commented Jun 3, 2026

Backport to 3.27: 💔 cherry-picking failed — conflicts found

❌ Failed to cleanly apply 207655f on top of patchback/backports/3.27/207655ff537671b6ffc2d01e55c03527d4768b98/pr-4471

Backporting merged PR #4471 into main

  1. Ensure you have a local repo clone of your fork. Unless you cloned it
    from the upstream, this would be your origin remote.
  2. Make sure you have an upstream repo added as a remote too. In these
    instructions you'll refer to it by the name upstream. If you don't
    have it, here's how you can add it:
    $ git remote add upstream https://github.com/pulp/pulp_rpm.git
  3. Ensure you have the latest copy of upstream and prepare a branch
    that will hold the backported code:
    $ git fetch upstream
    $ git checkout -b patchback/backports/3.27/207655ff537671b6ffc2d01e55c03527d4768b98/pr-4471 upstream/3.27
  4. Now, cherry-pick PR Sync optimization: do existing content check in first stage #4471 contents into that branch:
    $ git cherry-pick -x 207655ff537671b6ffc2d01e55c03527d4768b98
    If it'll yell at you with something like fatal: Commit 207655ff537671b6ffc2d01e55c03527d4768b98 is a merge but no -m option was given., add -m 1 as follows instead:
    $ git cherry-pick -m1 -x 207655ff537671b6ffc2d01e55c03527d4768b98
  5. At this point, you'll probably encounter some merge conflicts. You must
    resolve them in to preserve the patch from PR Sync optimization: do existing content check in first stage #4471 as close to the
    original as possible.
  6. Push this branch to your fork on GitHub:
    $ git push origin patchback/backports/3.27/207655ff537671b6ffc2d01e55c03527d4768b98/pr-4471
  7. Create a PR, ensure that the CI is green. If it's not — update it so that
    the tests and any other checks pass. This is it!
    Now relax and wait for the maintainers to process your pull request
    when they have some cycles to do reviews. Don't worry — they'll tell you if
    any improvements are necessary when the time comes!

🤖 @patchback
I'm built with octomachinery and
my source is open — https://github.com/sanitizers/patchback-github-app.

@patchback
Copy link
Copy Markdown

patchback Bot commented Jun 3, 2026

Backport to 3.32: 💔 cherry-picking failed — conflicts found

❌ Failed to cleanly apply 207655f on top of patchback/backports/3.32/207655ff537671b6ffc2d01e55c03527d4768b98/pr-4471

Backporting merged PR #4471 into main

  1. Ensure you have a local repo clone of your fork. Unless you cloned it
    from the upstream, this would be your origin remote.
  2. Make sure you have an upstream repo added as a remote too. In these
    instructions you'll refer to it by the name upstream. If you don't
    have it, here's how you can add it:
    $ git remote add upstream https://github.com/pulp/pulp_rpm.git
  3. Ensure you have the latest copy of upstream and prepare a branch
    that will hold the backported code:
    $ git fetch upstream
    $ git checkout -b patchback/backports/3.32/207655ff537671b6ffc2d01e55c03527d4768b98/pr-4471 upstream/3.32
  4. Now, cherry-pick PR Sync optimization: do existing content check in first stage #4471 contents into that branch:
    $ git cherry-pick -x 207655ff537671b6ffc2d01e55c03527d4768b98
    If it'll yell at you with something like fatal: Commit 207655ff537671b6ffc2d01e55c03527d4768b98 is a merge but no -m option was given., add -m 1 as follows instead:
    $ git cherry-pick -m1 -x 207655ff537671b6ffc2d01e55c03527d4768b98
  5. At this point, you'll probably encounter some merge conflicts. You must
    resolve them in to preserve the patch from PR Sync optimization: do existing content check in first stage #4471 as close to the
    original as possible.
  6. Push this branch to your fork on GitHub:
    $ git push origin patchback/backports/3.32/207655ff537671b6ffc2d01e55c03527d4768b98/pr-4471
  7. Create a PR, ensure that the CI is green. If it's not — update it so that
    the tests and any other checks pass. This is it!
    Now relax and wait for the maintainers to process your pull request
    when they have some cycles to do reviews. Don't worry — they'll tell you if
    any improvements are necessary when the time comes!

🤖 @patchback
I'm built with octomachinery and
my source is open — https://github.com/sanitizers/patchback-github-app.

@patchback
Copy link
Copy Markdown

patchback Bot commented Jun 3, 2026

Backport to 3.29: 💔 cherry-picking failed — conflicts found

❌ Failed to cleanly apply 207655f on top of patchback/backports/3.29/207655ff537671b6ffc2d01e55c03527d4768b98/pr-4471

Backporting merged PR #4471 into main

  1. Ensure you have a local repo clone of your fork. Unless you cloned it
    from the upstream, this would be your origin remote.
  2. Make sure you have an upstream repo added as a remote too. In these
    instructions you'll refer to it by the name upstream. If you don't
    have it, here's how you can add it:
    $ git remote add upstream https://github.com/pulp/pulp_rpm.git
  3. Ensure you have the latest copy of upstream and prepare a branch
    that will hold the backported code:
    $ git fetch upstream
    $ git checkout -b patchback/backports/3.29/207655ff537671b6ffc2d01e55c03527d4768b98/pr-4471 upstream/3.29
  4. Now, cherry-pick PR Sync optimization: do existing content check in first stage #4471 contents into that branch:
    $ git cherry-pick -x 207655ff537671b6ffc2d01e55c03527d4768b98
    If it'll yell at you with something like fatal: Commit 207655ff537671b6ffc2d01e55c03527d4768b98 is a merge but no -m option was given., add -m 1 as follows instead:
    $ git cherry-pick -m1 -x 207655ff537671b6ffc2d01e55c03527d4768b98
  5. At this point, you'll probably encounter some merge conflicts. You must
    resolve them in to preserve the patch from PR Sync optimization: do existing content check in first stage #4471 as close to the
    original as possible.
  6. Push this branch to your fork on GitHub:
    $ git push origin patchback/backports/3.29/207655ff537671b6ffc2d01e55c03527d4768b98/pr-4471
  7. Create a PR, ensure that the CI is green. If it's not — update it so that
    the tests and any other checks pass. This is it!
    Now relax and wait for the maintainers to process your pull request
    when they have some cycles to do reviews. Don't worry — they'll tell you if
    any improvements are necessary when the time comes!

🤖 @patchback
I'm built with octomachinery and
my source is open — https://github.com/sanitizers/patchback-github-app.

@patchback
Copy link
Copy Markdown

patchback Bot commented Jun 3, 2026

Backport to 3.36: 💚 backport PR created

✅ Backport PR branch: patchback/backports/3.36/207655ff537671b6ffc2d01e55c03527d4768b98/pr-4471

Backported as #4473

🤖 @patchback
I'm built with octomachinery and
my source is open — https://github.com/sanitizers/patchback-github-app.

@patchback
Copy link
Copy Markdown

patchback Bot commented Jun 3, 2026

Backport to 3.35: 💔 cherry-picking failed — conflicts found

❌ Failed to cleanly apply 207655f on top of patchback/backports/3.35/207655ff537671b6ffc2d01e55c03527d4768b98/pr-4471

Backporting merged PR #4471 into main

  1. Ensure you have a local repo clone of your fork. Unless you cloned it
    from the upstream, this would be your origin remote.
  2. Make sure you have an upstream repo added as a remote too. In these
    instructions you'll refer to it by the name upstream. If you don't
    have it, here's how you can add it:
    $ git remote add upstream https://github.com/pulp/pulp_rpm.git
  3. Ensure you have the latest copy of upstream and prepare a branch
    that will hold the backported code:
    $ git fetch upstream
    $ git checkout -b patchback/backports/3.35/207655ff537671b6ffc2d01e55c03527d4768b98/pr-4471 upstream/3.35
  4. Now, cherry-pick PR Sync optimization: do existing content check in first stage #4471 contents into that branch:
    $ git cherry-pick -x 207655ff537671b6ffc2d01e55c03527d4768b98
    If it'll yell at you with something like fatal: Commit 207655ff537671b6ffc2d01e55c03527d4768b98 is a merge but no -m option was given., add -m 1 as follows instead:
    $ git cherry-pick -m1 -x 207655ff537671b6ffc2d01e55c03527d4768b98
  5. At this point, you'll probably encounter some merge conflicts. You must
    resolve them in to preserve the patch from PR Sync optimization: do existing content check in first stage #4471 as close to the
    original as possible.
  6. Push this branch to your fork on GitHub:
    $ git push origin patchback/backports/3.35/207655ff537671b6ffc2d01e55c03527d4768b98/pr-4471
  7. Create a PR, ensure that the CI is green. If it's not — update it so that
    the tests and any other checks pass. This is it!
    Now relax and wait for the maintainers to process your pull request
    when they have some cycles to do reviews. Don't worry — they'll tell you if
    any improvements are necessary when the time comes!

🤖 @patchback
I'm built with octomachinery and
my source is open — https://github.com/sanitizers/patchback-github-app.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants