Conversation
* Remove 'Buy at Amazon' links * Set dirty flags and use bulk delete
| resources_to_delete_ids = [] | ||
| for content in WebsiteContent.objects.filter( | ||
| title__icontains=AMAZON_BUTTON_IMAGE, | ||
| type=CONTENT_TYPE_EXTERNAL_RESOURCE, | ||
| ): | ||
| if AMAZON_BUTTON_TITLE_PATTERN.match(content.title): | ||
| resources_to_delete_ids.append(content.pk) | ||
| website_ids.add(content.website_id) |
There was a problem hiding this comment.
Bug: The migration may delete ExternalResource records while leaving behind broken resource_link shortcodes that reference them, as it doesn't account for all possible shortcode text variations.
Severity: MEDIUM
Suggested Fix
Before deleting an ExternalResource, query its referencing_content relationship to find all markdown content that links to it. Ensure all references are removed, not just those matching a specific shortcode pattern. This validates referential integrity and prevents creating broken links.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.
Location: websites/migrations/0071_remove_buy_at_amazon_links.py#L38-L45
Potential issue: The migration deletes `ExternalResource` records based on a title match
for an Amazon image markdown. However, it does not verify that all `resource_link`
shortcodes pointing to that resource have been removed from markdown content. If a
resource is referenced using a shortcode with plain text instead of the image markdown
(e.g., `{{% resource_link "uuid" "Buy at Amazon" %}}`), the resource will be deleted,
but the shortcode will remain, resulting in a broken link. The migration fails to use
the `referencing_content` relationship to ensure referential integrity before deletion.
Did we get this right? 👍 / 👎 to inform future reviews.
| for content in WebsiteContent.objects.filter(markdown__icontains=AMAZON_BUTTON_IMAGE): | ||
| updated = AMAZON_BUTTON_SHORTCODE_PATTERN.sub("", content.markdown) | ||
| updated = AMAZON_BUTTON_IMAGE_PATTERN.sub("", updated) | ||
| if updated != content.markdown: | ||
| content.markdown = updated |
There was a problem hiding this comment.
Bug: The regex for removing Amazon button shortcodes fails to match variants with fragment arguments, leading to partial removal and creating broken shortcodes.
Severity: LOW
Suggested Fix
Update the AMAZON_BUTTON_SHORTCODE_PATTERN regex to optionally match a third positional argument for a fragment (e.g., "#[^\"]+") before the closing %\}\}. This will ensure the entire shortcode is removed correctly, even when a fragment is present.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.
Location: websites/migrations/0071_remove_buy_at_amazon_links.py#L30-L34
Potential issue: The migration uses two regex patterns sequentially. The first,
`AMAZON_BUTTON_SHORTCODE_PATTERN`, is intended to remove the entire Amazon button
shortcode. However, this pattern fails to match `resource_link` shortcodes that contain
an optional third fragment argument (e.g., `"#some-fragment"`). In such cases, the
second pattern, `AMAZON_BUTTON_IMAGE_PATTERN`, proceeds to remove only the image
markdown part of the shortcode's text argument. This leaves behind a syntactically
invalid shortcode with an empty text argument, such as `{{% resource_link "uuid" ""
"#fragment" %}}`, which could corrupt content.
Did we get this right? 👍 / 👎 to inform future reviews.
There was a problem hiding this comment.
Code Review
This pull request updates the version to 0.184.2 and introduces a data migration to remove 'Buy at Amazon' links from website content. The migration identifies and removes specific shortcodes and images from markdown fields and deletes the corresponding external resource records. The review feedback provides several optimizations for handling large datasets during the migration, including the use of .iterator(), .values_list(), and batch_size for bulk updates. Additionally, it is recommended to use .hard_delete() to align with the migration's stated intent of permanently removing the resources.
|
|
||
| markdown_to_update = [] | ||
| website_ids = set() | ||
| for content in WebsiteContent.objects.filter(markdown__icontains=AMAZON_BUTTON_IMAGE): |
There was a problem hiding this comment.
Using .iterator() here is recommended to avoid loading all matching WebsiteContent instances into memory at once. This is particularly important for global migrations on large tables like WebsiteContent to prevent potential memory exhaustion.
| for content in WebsiteContent.objects.filter(markdown__icontains=AMAZON_BUTTON_IMAGE): | |
| for content in WebsiteContent.objects.filter(markdown__icontains=AMAZON_BUTTON_IMAGE).iterator(): |
References
- Using .iterator() on a Django queryset is unnecessary for memory optimization if the number of records being processed is known to be small.
| for content in WebsiteContent.objects.filter( | ||
| title__icontains=AMAZON_BUTTON_IMAGE, | ||
| type=CONTENT_TYPE_EXTERNAL_RESOURCE, | ||
| ): | ||
| if AMAZON_BUTTON_TITLE_PATTERN.match(content.title): | ||
| resources_to_delete_ids.append(content.pk) | ||
| website_ids.add(content.website_id) |
There was a problem hiding this comment.
This loop can be optimized by using .values_list('pk', 'title', 'website_id'). This avoids instantiating full model objects for every record that matches the icontains filter, significantly reducing memory overhead and improving performance.
| for content in WebsiteContent.objects.filter( | |
| title__icontains=AMAZON_BUTTON_IMAGE, | |
| type=CONTENT_TYPE_EXTERNAL_RESOURCE, | |
| ): | |
| if AMAZON_BUTTON_TITLE_PATTERN.match(content.title): | |
| resources_to_delete_ids.append(content.pk) | |
| website_ids.add(content.website_id) | |
| resources_to_delete = WebsiteContent.objects.filter( | |
| title__icontains=AMAZON_BUTTON_IMAGE, | |
| type=CONTENT_TYPE_EXTERNAL_RESOURCE, | |
| ).values_list("pk", "title", "website_id") | |
| for pk, title, website_id in resources_to_delete: | |
| if AMAZON_BUTTON_TITLE_PATTERN.match(title): | |
| resources_to_delete_ids.append(pk) | |
| website_ids.add(website_id) |
|
|
||
| with transaction.atomic(): | ||
| if markdown_to_update: | ||
| WebsiteContent.objects.bulk_update(markdown_to_update, ["markdown"]) |
There was a problem hiding this comment.
If markdown_to_update contains a large number of records, bulk_update might hit database limits or consume excessive memory. Specifying a batch_size (e.g., 100 or 1000) is a safer practice for migrations involving large text fields.
| WebsiteContent.objects.bulk_update(markdown_to_update, ["markdown"]) | |
| WebsiteContent.objects.bulk_update(markdown_to_update, ["markdown"], batch_size=100) |
| WebsiteContent.objects.bulk_update(markdown_to_update, ["markdown"]) | ||
|
|
||
| if resources_to_delete_ids: | ||
| WebsiteContent.objects.filter(pk__in=resources_to_delete_ids).delete() |
There was a problem hiding this comment.
The docstring mentions a 'hard-delete', but since WebsiteContent is a SafeDeleteModel, the standard .delete() method on a queryset will perform a soft-delete. If the intention is to permanently remove these records (and their references), consider using .hard_delete() instead.
| WebsiteContent.objects.filter(pk__in=resources_to_delete_ids).delete() | |
| WebsiteContent.objects.filter(pk__in=resources_to_delete_ids).hard_delete() |
pt2302