Skip to content

Downstream implementation of parallel pulp container modules in python#73

Open
Alex-Welsh wants to merge 4 commits into
masterfrom
full-parallel
Open

Downstream implementation of parallel pulp container modules in python#73
Alex-Welsh wants to merge 4 commits into
masterfrom
full-parallel

Conversation

@Alex-Welsh
Copy link
Copy Markdown
Member

Full disclosure, Gen AI used to create new modules.
I've spent a while testing the changes against an aio-like environment, and included new test playbooks.

The code is complex, but fairly readable, it's a downstream tool and not at all critical. If we hit any issues we can easily revert to the old version.

Results for a full container sync/publish:
Old role: 2h 1m 31s
New modules: 10m 13s

The new implementation ran with a concurrency of 10, could be pushed higher. Test instance on SMS, using an aufn-like environment to create a seed that hosts pulp, so a fair bit of virtualisation overhead.

Depends on #71

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello @Alex-Welsh, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the performance and scalability of managing Pulp container content through Ansible. By introducing new parallelized modules and refactoring existing roles, it enables concurrent operations for repositories, remotes, distributions, and syncs, leading to substantial time savings in deployment and management workflows. The changes aim to provide a more robust and efficient experience for users interacting with Pulp container services.

Highlights

  • Performance Improvement: Implemented full parallelization for container sync and publish operations, drastically reducing execution time from over 2 hours to just over 10 minutes for a full container sync/publish, with potential for further optimization by increasing concurrency.
  • New Ansible Modules: Introduced new Ansible modules (container_repositories, container_remotes, container_distributions, container_syncs) to manage Pulp container entities in parallel, replacing previous single-entity modules and enhancing efficiency for batch operations.
  • Refactored Content Management: The pulp_container_content module has been removed and its functionality refactored into a new process_content.yml task file within the pulp_container_content role, leveraging generic API calls for more flexible and concurrent content unit management.
  • Concurrency Support: Added configurable concurrency options to the new container modules and relevant roles, allowing users to specify the maximum number of parallel operations for improved performance.
  • Updated Dependencies and Compatibility: The collection's galaxy.yml has been updated to version 0.6.0, including a new author and pinning the pulp.squeezer dependency to 0.2.3. The README.md now reflects compatibility with Ansible 11.
  • Enhanced Test Coverage: New test playbooks have been added for the container_distributions, container_remotes, container_repositories, and container_syncs modules to ensure the stability and correctness of the new parallel implementations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Ignored Files
  • Ignored by pattern: .github/workflows/** (1)
    • .github/workflows/pull_request.yml
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and impressive refactoring to parallelize Pulp container operations by introducing new concurrent Ansible modules. The performance gains described are substantial. The new modules for repositories, remotes, distributions, and syncs are well-structured, leveraging concurrent.futures for parallelism and pulp-glue for API interactions. The corresponding roles and tests have been effectively updated to use these new batch modules. My review includes a couple of suggestions for the new task-based implementation of container content management to improve correctness and simplify the code.

Comment thread roles/pulp_container_content/tasks/process_content.yml
Comment thread roles/pulp_container_content/tasks/process_content.yml Outdated
@Alex-Welsh Alex-Welsh force-pushed the full-parallel branch 16 times, most recently from 7d1881f to a0f69f2 Compare December 23, 2025 10:19
@Alex-Welsh Alex-Welsh changed the title Full parallel downstream implementation of pulp container modules Downstream implementation of parallel pulp container modules in python Jan 2, 2026
@Alex-Welsh Alex-Welsh marked this pull request as ready for review January 7, 2026 11:23
@Alex-Welsh Alex-Welsh requested a review from a team as a code owner January 7, 2026 11:23
@Alex-Welsh
Copy link
Copy Markdown
Member Author

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant performance improvement by implementing parallel processing for Pulp container operations. The new Ansible modules for repositories, remotes, distributions, and syncs are well-designed to handle batch operations concurrently, drastically reducing execution time as noted in the description. The code is generally clean and includes new tests for the added functionality. My review includes a few suggestions to improve maintainability, consistency, and security, such as refactoring duplicated code, improving parameter validation, and ensuring all sensitive data is sanitized. Overall, this is a great contribution that will significantly benefit users managing large numbers of container resources in Pulp.

Comment thread plugins/modules/container_remotes.py
Comment thread plugins/modules/container_distributions.py
Comment thread plugins/modules/container_distributions.py Outdated
Comment thread plugins/modules/container_repositories.py Outdated
Comment thread plugins/modules/container_syncs.py Outdated
Comment thread roles/pulp_repository/tasks/container.yml
@priteau
Copy link
Copy Markdown
Member

priteau commented Feb 24, 2026

I am not keen on having a commit with GPL license and another with Apache 2, could you please squash?

@priteau
Copy link
Copy Markdown
Member

priteau commented Feb 24, 2026

My own comparison of the old and new implementation on repo sync/publish (on an already synced Pulp):

  • pulp-repo-sync.yml: before 4m39.717s, after 1m25.681s
  • pulp-repo-publish.yml: before 0m31.651s, after 0m27.345s

@technowhizz
Copy link
Copy Markdown

@Alex-Welsh Is this ready to be used?

@Alex-Welsh
Copy link
Copy Markdown
Member Author

@Alex-Welsh Is this ready to be used?

Yeah, I've used it a lot over the last few months, it all seems to work now

Copy link
Copy Markdown

@grzegorzkoper grzegorzkoper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, great PR - would like it to get merged asap.
Since it was AI generated, I figured it should be AI reviewed.
My clankers found some minor issues, after that LGTM

Comment thread roles/pulp_repository/tasks/main.yml
Comment thread roles/pulp_repository/tasks/main.yml Outdated
Comment thread plugins/modules/container_syncs.py
Comment thread plugins/modules/container_repositories.py
Comment thread plugins/modules/container_remotes.py
Comment thread plugins/modules/container_distributions.py
grzegorzkoper
grzegorzkoper previously approved these changes May 12, 2026
Copy link
Copy Markdown

@grzegorzkoper grzegorzkoper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM

Comment thread plugins/modules/container_remotes.py Outdated
Comment thread plugins/modules/container_remotes.py Outdated
Comment thread plugins/modules/container_remotes.py
Comment thread plugins/modules/container_remotes.py
@bbezak
Copy link
Copy Markdown
Member

bbezak commented May 12, 2026

#74

Mark remote/proxy usernames as no_log and strip them from returned
container remote data, matching the old pulp.squeezer remote module
behavior.

Signed-off-by: Bartosz Bezak <bartosz@stackhpc.com>
@bbezak
Copy link
Copy Markdown
Member

bbezak commented May 12, 2026

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several new Ansible modules—container_distributions, container_remotes, container_repositories, and container_syncs—designed to perform batch operations concurrently on a Pulp server. The existing container tasks in the pulp_distribution and pulp_repository roles have been refactored to use these batch modules, and a new prereqs task ensures the required pulp-glue dependencies are installed. Feedback on the PR identifies critical missing existence checks for content guards, repositories, and remotes within the new modules to prevent potential runtime errors. Additionally, there is a recommendation to refactor the list-building logic in the Ansible tasks to avoid the performance overhead associated with looping set_fact.

Comment on lines +194 to +197
content_guard_ctx = PulpContentGuardContext(
pulp_ctx, entity={"name": content_guard_name}
)
desired_attributes["content_guard"] = content_guard_ctx.pulp_href
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The module should verify that the content guard exists if a name is provided. Accessing pulp_href on a context with a missing entity may lead to unexpected behavior or errors if the guard cannot be found in Pulp.

Suggested change
content_guard_ctx = PulpContentGuardContext(
pulp_ctx, entity={"name": content_guard_name}
)
desired_attributes["content_guard"] = content_guard_ctx.pulp_href
content_guard_ctx = PulpContentGuardContext(
pulp_ctx, entity={"name": content_guard_name}
)
if not content_guard_ctx.entity:
result["failed"] = True
result["msg"] = f"Content guard '{content_guard_name}' not found."
return result
desired_attributes["content_guard"] = content_guard_ctx.pulp_href

repository_ctx = PulpContainerRepositoryContext(
pulp_ctx, entity={"name": sync_item["repository"]}
)
repository = repository_ctx.entity
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The module should verify that the repository exists before attempting to access its attributes. If repository_ctx.entity is None, the subsequent call to repository.get("remote") on line 170 will raise an AttributeError.

Suggested change
repository = repository_ctx.entity
repository = repository_ctx.entity
if not repository:
result["failed"] = True
result["msg"] = f"Repository '{sync_item['repository']}' not found."
return result

Comment on lines +175 to +178
remote_ctx = PulpContainerRemoteContext(
pulp_ctx, entity={"name": remote_name}
)
payload["remote"] = remote_ctx
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The module should verify that the remote exists before using it in the sync payload. Additionally, it is safer to pass the remote's pulp_href (a string) rather than the context object itself to the sync method to ensure compatibility with the underlying API client.

Suggested change
remote_ctx = PulpContainerRemoteContext(
pulp_ctx, entity={"name": remote_name}
)
payload["remote"] = remote_ctx
remote_ctx = PulpContainerRemoteContext(
pulp_ctx, entity={"name": remote_name}
)
if not remote_ctx.entity:
result["failed"] = True
result["msg"] = f"Remote '{remote_name}' not found."
return result
payload["remote"] = remote_ctx.pulp_href

Comment on lines +25 to +42
container_remotes_list: "{{ container_remotes_list + [{
'name': item.name + '-remote',
'upstream_name': item.get('upstream_name', item.name),
'url': item.get('url'),
'ca_cert': item.get('ca_cert'),
'client_cert': item.get('client_cert'),
'client_key': item.get('client_key'),
'download_concurrency': item.get('download_concurrency'),
'exclude_tags': item.get('exclude_tags'),
'include_tags': item.get('include_tags'),
'policy': item.get('policy'),
'proxy_url': item.get('proxy_url'),
'proxy_username': item.get('proxy_username'),
'proxy_password': item.get('proxy_password'),
'remote_username': item.get('remote_username'),
'remote_password': item.get('remote_password'),
'tls_validation': item.get('tls_validation'),
'state': item.get('state') } | dict2items | rejectattr('value', 'none') | items2dict ] }}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using set_fact within a loop to build a list is an anti-pattern in Ansible because it re-evaluates the entire list on every iteration, leading to $O(N^2)$ performance. Given that this PR aims to improve performance through parallelism, consider refactoring this to use a single set_fact with a complex Jinja2 expression or a custom filter to build the list in one go, which would be significantly more efficient for a large number of repositories.

bbezak added 2 commits May 12, 2026 16:55
Resolve content guard names to hrefs in the batch container
distribution module, matching the old pulp.squeezer behavior. Add a
test covering guarded container distributions.

Signed-off-by: Bartosz Bezak <bartosz@stackhpc.com>
Hide the set_fact task that builds remotes with passwords and keys.

Signed-off-by: Bartosz Bezak <bartosz@stackhpc.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants