Downstream implementation of parallel pulp container modules in python#73
Downstream implementation of parallel pulp container modules in python#73Alex-Welsh wants to merge 4 commits into
Conversation
Summary of ChangesHello @Alex-Welsh, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the performance and scalability of managing Pulp container content through Ansible. By introducing new parallelized modules and refactoring existing roles, it enables concurrent operations for repositories, remotes, distributions, and syncs, leading to substantial time savings in deployment and management workflows. The changes aim to provide a more robust and efficient experience for users interacting with Pulp container services. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Ignored Files
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a significant and impressive refactoring to parallelize Pulp container operations by introducing new concurrent Ansible modules. The performance gains described are substantial. The new modules for repositories, remotes, distributions, and syncs are well-structured, leveraging concurrent.futures for parallelism and pulp-glue for API interactions. The corresponding roles and tests have been effectively updated to use these new batch modules. My review includes a couple of suggestions for the new task-based implementation of container content management to improve correctness and simplify the code.
7d1881f to
a0f69f2
Compare
a0f69f2 to
5440658
Compare
5440658 to
d2a0b81
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a significant performance improvement by implementing parallel processing for Pulp container operations. The new Ansible modules for repositories, remotes, distributions, and syncs are well-designed to handle batch operations concurrently, drastically reducing execution time as noted in the description. The code is generally clean and includes new tests for the added functionality. My review includes a few suggestions to improve maintainability, consistency, and security, such as refactoring duplicated code, improving parameter validation, and ensuring all sensitive data is sanitized. Overall, this is a great contribution that will significantly benefit users managing large numbers of container resources in Pulp.
9f8f40c to
93589d8
Compare
|
I am not keen on having a commit with GPL license and another with Apache 2, could you please squash? |
93589d8 to
e293c07
Compare
|
My own comparison of the old and new implementation on repo sync/publish (on an already synced Pulp):
|
e293c07 to
34b7e60
Compare
|
@Alex-Welsh Is this ready to be used? |
Yeah, I've used it a lot over the last few months, it all seems to work now |
Mark remote/proxy usernames as no_log and strip them from returned container remote data, matching the old pulp.squeezer remote module behavior. Signed-off-by: Bartosz Bezak <bartosz@stackhpc.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces several new Ansible modules—container_distributions, container_remotes, container_repositories, and container_syncs—designed to perform batch operations concurrently on a Pulp server. The existing container tasks in the pulp_distribution and pulp_repository roles have been refactored to use these batch modules, and a new prereqs task ensures the required pulp-glue dependencies are installed. Feedback on the PR identifies critical missing existence checks for content guards, repositories, and remotes within the new modules to prevent potential runtime errors. Additionally, there is a recommendation to refactor the list-building logic in the Ansible tasks to avoid the performance overhead associated with looping set_fact.
| content_guard_ctx = PulpContentGuardContext( | ||
| pulp_ctx, entity={"name": content_guard_name} | ||
| ) | ||
| desired_attributes["content_guard"] = content_guard_ctx.pulp_href |
There was a problem hiding this comment.
The module should verify that the content guard exists if a name is provided. Accessing pulp_href on a context with a missing entity may lead to unexpected behavior or errors if the guard cannot be found in Pulp.
| content_guard_ctx = PulpContentGuardContext( | |
| pulp_ctx, entity={"name": content_guard_name} | |
| ) | |
| desired_attributes["content_guard"] = content_guard_ctx.pulp_href | |
| content_guard_ctx = PulpContentGuardContext( | |
| pulp_ctx, entity={"name": content_guard_name} | |
| ) | |
| if not content_guard_ctx.entity: | |
| result["failed"] = True | |
| result["msg"] = f"Content guard '{content_guard_name}' not found." | |
| return result | |
| desired_attributes["content_guard"] = content_guard_ctx.pulp_href |
| repository_ctx = PulpContainerRepositoryContext( | ||
| pulp_ctx, entity={"name": sync_item["repository"]} | ||
| ) | ||
| repository = repository_ctx.entity |
There was a problem hiding this comment.
The module should verify that the repository exists before attempting to access its attributes. If repository_ctx.entity is None, the subsequent call to repository.get("remote") on line 170 will raise an AttributeError.
| repository = repository_ctx.entity | |
| repository = repository_ctx.entity | |
| if not repository: | |
| result["failed"] = True | |
| result["msg"] = f"Repository '{sync_item['repository']}' not found." | |
| return result |
| remote_ctx = PulpContainerRemoteContext( | ||
| pulp_ctx, entity={"name": remote_name} | ||
| ) | ||
| payload["remote"] = remote_ctx |
There was a problem hiding this comment.
The module should verify that the remote exists before using it in the sync payload. Additionally, it is safer to pass the remote's pulp_href (a string) rather than the context object itself to the sync method to ensure compatibility with the underlying API client.
| remote_ctx = PulpContainerRemoteContext( | |
| pulp_ctx, entity={"name": remote_name} | |
| ) | |
| payload["remote"] = remote_ctx | |
| remote_ctx = PulpContainerRemoteContext( | |
| pulp_ctx, entity={"name": remote_name} | |
| ) | |
| if not remote_ctx.entity: | |
| result["failed"] = True | |
| result["msg"] = f"Remote '{remote_name}' not found." | |
| return result | |
| payload["remote"] = remote_ctx.pulp_href |
| container_remotes_list: "{{ container_remotes_list + [{ | ||
| 'name': item.name + '-remote', | ||
| 'upstream_name': item.get('upstream_name', item.name), | ||
| 'url': item.get('url'), | ||
| 'ca_cert': item.get('ca_cert'), | ||
| 'client_cert': item.get('client_cert'), | ||
| 'client_key': item.get('client_key'), | ||
| 'download_concurrency': item.get('download_concurrency'), | ||
| 'exclude_tags': item.get('exclude_tags'), | ||
| 'include_tags': item.get('include_tags'), | ||
| 'policy': item.get('policy'), | ||
| 'proxy_url': item.get('proxy_url'), | ||
| 'proxy_username': item.get('proxy_username'), | ||
| 'proxy_password': item.get('proxy_password'), | ||
| 'remote_username': item.get('remote_username'), | ||
| 'remote_password': item.get('remote_password'), | ||
| 'tls_validation': item.get('tls_validation'), | ||
| 'state': item.get('state') } | dict2items | rejectattr('value', 'none') | items2dict ] }}" |
There was a problem hiding this comment.
Using set_fact within a loop to build a list is an anti-pattern in Ansible because it re-evaluates the entire list on every iteration, leading to set_fact with a complex Jinja2 expression or a custom filter to build the list in one go, which would be significantly more efficient for a large number of repositories.
Resolve content guard names to hrefs in the batch container distribution module, matching the old pulp.squeezer behavior. Add a test covering guarded container distributions. Signed-off-by: Bartosz Bezak <bartosz@stackhpc.com>
Hide the set_fact task that builds remotes with passwords and keys. Signed-off-by: Bartosz Bezak <bartosz@stackhpc.com>
Full disclosure, Gen AI used to create new modules.
I've spent a while testing the changes against an aio-like environment, and included new test playbooks.
The code is complex, but fairly readable, it's a downstream tool and not at all critical. If we hit any issues we can easily revert to the old version.
Results for a full container sync/publish:
Old role: 2h 1m 31s
New modules: 10m 13s
The new implementation ran with a concurrency of 10, could be pushed higher. Test instance on SMS, using an aufn-like environment to create a seed that hosts pulp, so a fair bit of virtualisation overhead.
Depends on #71