Conversation
* Update reset_sync_states management command * Improve efficiency and messages to user
* feat: Add S3 content sync pipeline mitodl/hq#8703
* feat: added tracking with referencing_content for course home image, video gallery items, resource list items, and other relation-backed content. * feat: added referenced_by for stories, and featured items on home page * chore: addressed feedback from copilot and gemini * refactor: parameterized tests and removed redundant tests * refactor: parameterized tests
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Summary of ChangesHello @odlbot, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request, marking release 0.178.3, delivers substantial improvements to content management and synchronization. It introduces a robust S3 bucket synchronization pipeline for automated content updates and refines existing content sync mechanisms to offer more targeted control. A major focus is also on enhancing content relationship tracking by recognizing new content types and their associated metadata fields, ensuring a more interconnected content ecosystem. Additionally, it includes a crucial security update for a core dependency and a minor infrastructure change for a polyfill resource. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
| filtered_websites = [] | ||
| if self.filter_list: | ||
| filtered_websites = [ | ||
| fetch_website(site_identifier) for site_identifier in self.filter_list | ||
| ] |
There was a problem hiding this comment.
Bug: The list comprehension calling fetch_website does not handle the Website.DoesNotExist exception, causing the command to crash if an invalid website identifier is provided via a filter.
Severity: HIGH
Suggested Fix
Wrap the call to fetch_website(site_identifier) inside a try...except Website.DoesNotExist block. Log a warning or skip the invalid identifier instead of letting the command crash. Alternatively, consider using a queryset-based filter approach which gracefully handles non-existent identifiers.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.
Location: content_sync/management/commands/reset_sync_states.py#L68-L72
Potential issue: The management command `reset_sync_states` processes a list of website
identifiers from user-provided command-line arguments (`--filter` or `--filter-json`).
The code uses a list comprehension to call `fetch_website` for each identifier. However,
`fetch_website` will raise a `Website.DoesNotExist` exception if an identifier is not
found, and this exception is not handled. This will cause the entire management command
to crash. The same unhandled exception pattern also exists in the
`sync_website_to_backend.py` and `sync_backend_to_db.py` commands when they process
filters.
Did we get this right? 👍 / 👎 to inform future reviews.
There was a problem hiding this comment.
Code Review
The pull request introduces an S3 bucket synchronization pipeline and updates the reset_sync_states management command to handle filtered websites directly. It also includes changes to track referencing content for new content types and updates the pypdf dependency for security. The changes enhance the content synchronization capabilities and improve the accuracy of content referencing within the system. Several new tests have been added following the preferred pytest style, and a comment regarding Celery task management has been updated to align with best practices for robust and scalable asynchronous task handling.
| [[package]] | ||
| name = "pypdf" | ||
| version = "6.6.2" | ||
| version = "6.7.1" |
| @is_publish_pipeline_enabled | ||
| def get_s3_bucket_sync_pipeline(api: object | None = None) -> BasePipeline: | ||
| """Get the S3 bucket sync pipeline""" | ||
| return import_string( | ||
| f"content_sync.pipelines.{settings.CONTENT_SYNC_PIPELINE_BACKEND}.S3BucketSyncPipeline" | ||
| )(api=api) |
There was a problem hiding this comment.
The get_s3_bucket_sync_pipeline function is decorated with @is_publish_pipeline_enabled. While this is generally good for controlling pipeline access, ensure that the S3 bucket sync pipeline is indeed considered a 'publish' pipeline and that its functionality aligns with the intent of this decorator. If it's a background sync task that doesn't directly involve publishing user-facing content, a more specific decorator or a different mechanism might be clearer.
| filtered_websites = [] | ||
| if self.filter_list: | ||
| filtered_websites = [ | ||
| fetch_website(site_identifier) for site_identifier in self.filter_list | ||
| ] |
There was a problem hiding this comment.
| start = now_in_utc() | ||
| task = sync_unsynced_websites.delay(create_backends=create_backends) | ||
| self.stdout.write(f"Starting task {task}...") | ||
| task.get() | ||
| if filtered_websites: | ||
| self.stdout.write( | ||
| f"Syncing {len(filtered_websites)} filtered " | ||
| "website(s) to the designated backend" | ||
| ) | ||
| for website in filtered_websites: | ||
| backend = get_sync_backend(website) | ||
| if create_backends or backend.backend_exists(): | ||
| self.stdout.write( | ||
| f"Syncing website '{website.title}' to backend..." | ||
| ) | ||
| backend.create_website_in_backend() | ||
| backend.sync_all_content_to_backend() | ||
| reset_publishing_fields(website.name) | ||
| else: | ||
| self.stderr.write( | ||
| f"Skipping website '{website.title}': " | ||
| "backend does not exist " | ||
| "(use --create_backends to create it)" | ||
| ) | ||
| else: | ||
| self.stdout.write( | ||
| "Syncing all unsynced websites to the designated backend" | ||
| ) | ||
| task = sync_unsynced_websites.delay(create_backends=create_backends) | ||
| self.stdout.write(f"Starting task {task}...") | ||
| task.get() |
There was a problem hiding this comment.
The logic for syncing filtered websites directly within the handle method duplicates some of the functionality found in sync_unsynced_websites.delay. While it provides immediate feedback for filtered sites, it bypasses the Celery task queue for these specific cases. This could lead to inconsistencies in how sync operations are handled, especially regarding error handling, retries, and task monitoring. It might be better to refactor sync_unsynced_websites to accept a list of website names or IDs to process, allowing all sync operations to go through Celery for consistency and scalability.
References
- Use Celery's
chordandchainto manage dependencies between asynchronous tasks, as it is a more robust and scalable pattern than executing them synchronously.
renovate[bot]
Umar Hassan
Sar
Chris Patti
pt2302