Skip to content

Release 0.178.3#2908

Merged
odlbot merged 7 commits intoreleasefrom
release-candidate
Feb 24, 2026
Merged

Release 0.178.3#2908
odlbot merged 7 commits intoreleasefrom
release-candidate

Conversation

@odlbot
Copy link
Copy Markdown
Contributor

@odlbot odlbot commented Feb 20, 2026

renovate[bot]

Umar Hassan

Sar

Chris Patti

pt2302

pt2302 and others added 7 commits February 17, 2026 06:45
* Update reset_sync_states management command

* Improve efficiency and messages to user
* feat: Add S3 content sync pipeline


mitodl/hq#8703
* feat: added tracking with referencing_content for course home image, video gallery items, resource list items, and other relation-backed content.

* feat: added referenced_by for stories, and featured items on home page

* chore: addressed feedback from copilot and gemini

* refactor: parameterized tests and removed redundant tests

* refactor: parameterized tests
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @odlbot, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request, marking release 0.178.3, delivers substantial improvements to content management and synchronization. It introduces a robust S3 bucket synchronization pipeline for automated content updates and refines existing content sync mechanisms to offer more targeted control. A major focus is also on enhancing content relationship tracking by recognizing new content types and their associated metadata fields, ensuring a more interconnected content ecosystem. Additionally, it includes a crucial security update for a core dependency and a minor infrastructure change for a polyfill resource.

Highlights

  • S3 Content Synchronization Pipeline: A new S3 content synchronization pipeline has been introduced, allowing for periodic syncing of S3 buckets. This includes a new management command and Concourse pipeline definition to manage its lifecycle.
  • Enhanced Content Referencing: The system for tracking referencing content has been significantly expanded to include new content types such as course collections, promos, testimonials, stories, and homepage settings. This improves the accuracy and completeness of content relationship tracking.
  • Dependency Update: The pypdf dependency has been updated to version 6.7.1, addressing a security vulnerability.
  • Improved reset_sync_states Command: The reset_sync_states management command now supports filtering websites for synchronization, providing more granular control over the content syncing process.
  • Polyfill URL Update: The polyfill URL in base.html has been updated to use a different CDN source.
Changelog
  • RELEASE.rst
    • Added release notes for version 0.178.3, summarizing the included changes.
  • content_sync/api.py
    • Added a new function get_s3_bucket_sync_pipeline to retrieve the S3 bucket synchronization pipeline.
  • content_sync/management/commands/reset_sync_states.py
    • Imported get_sync_backend, fetch_website, and reset_publishing_fields.
    • Modified the handle method to support filtering websites for synchronization and to use the new S3 sync logic.
  • content_sync/management/commands/upsert_s3_bucket_sync_pipeline.py
    • Added a new management command to create or update the S3 bucket sync Concourse pipeline, including options to unpause or delete existing pipelines.
  • content_sync/pipelines/base.py
    • Introduced BaseS3BucketSyncPipeline as a base class for S3 bucket syncing pipelines.
  • content_sync/pipelines/concourse.py
    • Imported BaseS3BucketSyncPipeline and S3BucketSyncPipelineDefinition.
    • Added S3BucketSyncPipeline class, extending BaseS3BucketSyncPipeline and GeneralPipeline, to define the Concourse pipeline for S3 bucket synchronization.
  • content_sync/pipelines/definitions/concourse/s3_bucket_sync_pipeline.py
    • Added a new file defining the S3BucketSyncPipelineDefinition class, which constructs a Concourse pipeline for periodic S3 bucket syncing using AWS CLI.
  • content_sync/tasks.py
    • Added a new Celery task upsert_s3_bucket_sync_pipeline to manage the creation and updating of the S3 bucket sync pipeline.
  • main/settings.py
    • Updated the project version to 0.178.3.
    • Added new settings: AWS_IMPORT_STORAGE_BUCKET_NAME and AWS_S3_SYNC_INTERVAL for S3 synchronization configuration.
  • main/templates/base.html
    • Updated the polyfill.io URL to use cdnjs.cloudflare.com.
  • poetry.lock
    • Updated the pypdf dependency from 6.6.2 to 6.7.1.
    • Removed black from the dev extra dependencies for pypdf.
  • websites/constants.py
    • Added new content types: CONTENT_TYPE_COURSE_COLLECTION, CONTENT_TYPE_PROMO, CONTENT_TYPE_TESTIMONIAL, CONTENT_TYPE_STORY, CONTENT_TYPE_HOMEPAGE_SETTINGS.
    • Added new metadata fields for various content types, such as METADATA_FIELD_COURSE_IMAGE, METADATA_FIELD_RESOURCE_LIST_RESOURCES, METADATA_FIELD_COVER_IMAGE, METADATA_FIELD_FEATURED_PROMOS, etc.
  • websites/management/commands/backpopulate_referencing_content_test.py
    • Added docstrings to several test fixture functions for clarity.
  • websites/utils.py
    • Introduced UUID_REGEX_STR and pre-compiled regexes for parse_resource_uuid for performance.
    • Refined parse_resource_uuid to handle more complex resource shortcode patterns, including multiple UUIDs within attributes and positional UUIDs.
    • Added _extract_relation_text_ids helper function to extract UUIDs from relation widget data.
    • Expanded get_metadata_content_key to include new content types and their associated metadata fields for referencing content.
    • Modified compile_referencing_content to use the new UUID parsing logic and handle direct UUID strings in metadata fields.
  • websites/utils_test.py
    • Added extensive new test cases for parse_resource_uuid to cover various shortcode formats, including multiple UUIDs and positional UUIDs.
    • Added new test cases for compile_referencing_content to verify correct extraction of referencing content from the newly defined content types and their metadata fields, including cross-site relations and edge cases.
Activity
  • renovate[bot] updated the pypdf dependency to v6.7.1 for security.
  • Umar Hassan implemented a feature to track referencing_content for missing content types.
  • Sar replaced a polyfill URL.
  • Chris Patti added an S3 content sync pipeline.
  • pt2302 updated the reset_sync_states management command.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Comment on lines +68 to +72
filtered_websites = []
if self.filter_list:
filtered_websites = [
fetch_website(site_identifier) for site_identifier in self.filter_list
]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The list comprehension calling fetch_website does not handle the Website.DoesNotExist exception, causing the command to crash if an invalid website identifier is provided via a filter.
Severity: HIGH

Suggested Fix

Wrap the call to fetch_website(site_identifier) inside a try...except Website.DoesNotExist block. Log a warning or skip the invalid identifier instead of letting the command crash. Alternatively, consider using a queryset-based filter approach which gracefully handles non-existent identifiers.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: content_sync/management/commands/reset_sync_states.py#L68-L72

Potential issue: The management command `reset_sync_states` processes a list of website
identifiers from user-provided command-line arguments (`--filter` or `--filter-json`).
The code uses a list comprehension to call `fetch_website` for each identifier. However,
`fetch_website` will raise a `Website.DoesNotExist` exception if an identifier is not
found, and this exception is not handled. This will cause the entire management command
to crash. The same unhandled exception pattern also exists in the
`sync_website_to_backend.py` and `sync_backend_to_db.py` commands when they process
filters.

Did we get this right? 👍 / 👎 to inform future reviews.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces an S3 bucket synchronization pipeline and updates the reset_sync_states management command to handle filtered websites directly. It also includes changes to track referencing content for new content types and updates the pypdf dependency for security. The changes enhance the content synchronization capabilities and improve the accuracy of content referencing within the system. Several new tests have been added following the preferred pytest style, and a comment regarding Celery task management has been updated to align with best practices for robust and scalable asynchronous task handling.

Comment thread poetry.lock
[[package]]
name = "pypdf"
version = "6.6.2"
version = "6.7.1"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-critical critical

The pypdf dependency is updated to 6.7.1. This is a security update, which is a critical maintenance task.

Comment thread content_sync/api.py
Comment on lines +126 to +131
@is_publish_pipeline_enabled
def get_s3_bucket_sync_pipeline(api: object | None = None) -> BasePipeline:
"""Get the S3 bucket sync pipeline"""
return import_string(
f"content_sync.pipelines.{settings.CONTENT_SYNC_PIPELINE_BACKEND}.S3BucketSyncPipeline"
)(api=api)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The get_s3_bucket_sync_pipeline function is decorated with @is_publish_pipeline_enabled. While this is generally good for controlling pipeline access, ensure that the S3 bucket sync pipeline is indeed considered a 'publish' pipeline and that its functionality aligns with the intent of this decorator. If it's a background sync task that doesn't directly involve publishing user-facing content, a more specific decorator or a different mechanism might be clearer.

Comment on lines +68 to +72
filtered_websites = []
if self.filter_list:
filtered_websites = [
fetch_website(site_identifier) for site_identifier in self.filter_list
]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The filtered_websites list is created by fetching each website from self.filter_list. This approach could lead to N+1 queries if fetch_website performs database lookups for each site. Consider optimizing this by fetching all websites in a single query if self.filter_list is large.

Comment on lines 101 to +128
start = now_in_utc()
task = sync_unsynced_websites.delay(create_backends=create_backends)
self.stdout.write(f"Starting task {task}...")
task.get()
if filtered_websites:
self.stdout.write(
f"Syncing {len(filtered_websites)} filtered "
"website(s) to the designated backend"
)
for website in filtered_websites:
backend = get_sync_backend(website)
if create_backends or backend.backend_exists():
self.stdout.write(
f"Syncing website '{website.title}' to backend..."
)
backend.create_website_in_backend()
backend.sync_all_content_to_backend()
reset_publishing_fields(website.name)
else:
self.stderr.write(
f"Skipping website '{website.title}': "
"backend does not exist "
"(use --create_backends to create it)"
)
else:
self.stdout.write(
"Syncing all unsynced websites to the designated backend"
)
task = sync_unsynced_websites.delay(create_backends=create_backends)
self.stdout.write(f"Starting task {task}...")
task.get()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for syncing filtered websites directly within the handle method duplicates some of the functionality found in sync_unsynced_websites.delay. While it provides immediate feedback for filtered sites, it bypasses the Celery task queue for these specific cases. This could lead to inconsistencies in how sync operations are handled, especially regarding error handling, retries, and task monitoring. It might be better to refactor sync_unsynced_websites to accept a list of website names or IDs to process, allowing all sync operations to go through Celery for consistency and scalability.

References
  1. Use Celery's chord and chain to manage dependencies between asynchronous tasks, as it is a more robust and scalable pattern than executing them synchronously.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants