feat: migrate caption and transcript resource relation#3031
Conversation
There was a problem hiding this comment.
Code Review
This pull request transitions video captions and transcripts from simple file path strings to multi-select relation fields. Key changes include updating the site configuration schema, adding utility functions to process multi-language file paths, and implementing data migrations to back-populate resource fields from existing relationships. Review feedback suggests scoping resource queries by website and preserving selection order in serializers, optimizing migration performance by avoiding N+1 queries, and manually updating website republish flags since historical model save methods do not trigger standard side effects.
| if isinstance(relation_id, list): | ||
| # Multi-select format — content is an array of text_ids. | ||
| resources = list(WebsiteContent.objects.filter(text_id__in=relation_id)) | ||
| set_dict_field(metadata, target_field, resource_file_paths(resources)) | ||
| else: | ||
| # Scalar format — content is a single text_id string. | ||
| resource = ( | ||
| WebsiteContent.objects.filter(text_id=relation_id).first() | ||
| if relation_id | ||
| else None | ||
| ) | ||
| set_dict_field(metadata, target_field, resource_file_path(resource)) |
There was a problem hiding this comment.
The WebsiteContent query should be scoped to the specific website to ensure correctness, as text_id is only unique within a website. Additionally, the order of resources should be preserved to match the order in relation_id, which is important for multi-select fields.
Note that relation_value typically contains a website key (the website name or URL path) which can be used for filtering.
if isinstance(relation_id, list):
# Multi-select format — content is an array of text_ids.
website_identifier = relation_value.get("website")
resources = WebsiteContent.objects.filter(
Q(website__name=website_identifier) | Q(website__url_path=website_identifier),
text_id__in=relation_id
)
# Preserve the order of resources as specified in relation_id
resource_map = {r.text_id: r for r in resources}
ordered_resources = [resource_map[tid] for tid in relation_id if tid in resource_map]
set_dict_field(metadata, target_field, resource_file_paths(ordered_resources))
else:
# Scalar format — content is a single text_id string.
website_identifier = relation_value.get("website")
resource = (
WebsiteContent.objects.filter(
Q(website__name=website_identifier) | Q(website__url_path=website_identifier),
text_id=relation_id
).first()
if relation_id
else None
)
set_dict_field(metadata, target_field, resource_file_path(resource))| captions_ids = list( | ||
| content.referenced_by.filter(filename__contains="_captions_").values_list( | ||
| "text_id", flat=True | ||
| ) | ||
| ) | ||
| transcript_ids = list( | ||
| content.referenced_by.filter( | ||
| filename__contains="_transcript_" | ||
| ).values_list("text_id", flat=True) | ||
| ) |
| changed = True | ||
|
|
||
| if changed: | ||
| content.save(update_fields=["metadata"]) |
There was a problem hiding this comment.
According to the organization's general rules, custom model methods like save() are not available on historical models during migrations. Therefore, side effects like updating the website's republish flags (has_unpublished_live, has_unpublished_draft) must be performed manually. You should collect the IDs of updated websites and perform a bulk update at the end of the migration.
References
- When performing a data migration in Django, custom model methods like save() are not available on historical models. If side effects like triggering a website republish are needed, they must be performed manually, for example, by directly updating flags on related models.
| changed = True | ||
|
|
||
| if changed: | ||
| content.save(update_fields=["metadata"]) |
There was a problem hiding this comment.
Similar to migration 0073, the republish flags on the Website model should be updated manually since the historical WebsiteContent.save() method won't trigger the side effects defined in the actual model class.
References
- When performing a data migration in Django, custom model methods like save() are not available on historical models. If side effects like triggering a website republish are needed, they must be performed manually, for example, by directly updating flags on related models.
5267047 to
412b552
Compare
- Add migration 0074: convert scalar _resource content to list format
- Update sync_video_relation_urls to handle both scalar and list content
- Update full_metadata to handle array-of-{file,language} _file values
- Update resolve_video_file_referenced_content_ids for array _file format
- Add video_captions_resource and video_transcript_resource relation fields (multiple: true) to site configs
- Add filter_type: not_equals to site-config schema
- Add resource_file_paths helper to videos/utils.py (language detection placeholder)
- Add tests: migration_0074, models, utils, serializers, site_config_api
412b552 to
8927014
Compare
There was a problem hiding this comment.
Pull request overview
This PR updates how video caption/transcript metadata is represented and migrated, moving caption/transcript resource relations to multi-select while preserving compatibility for legacy caption/transcript file path metadata formats.
Changes:
- Update caption/transcript URL-path handling to support both legacy scalar strings and the new multi-language array-of-objects format.
- Add tests covering multi-language caption/transcript path resolution and full_metadata URL rewriting.
- Add a data migration to convert
video_*_resource.contentfrom scalar to list format, and update site configs/schema to support new relation widget behavior.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| websites/utils.py | Adds support for resolving referenced content IDs from both scalar and array caption/transcript file metadata formats. |
| websites/utils_test.py | Adds test coverage for the array-of-objects caption/transcript file format. |
| websites/site_config_api_test.py | Updates expected generated metadata to include the new captions/transcript resource fields. |
| websites/serializers_test.py | Adds serializer test for saving multi-select caption/transcript resources and persisting {file, language} lists. |
| websites/models.py | Updates full_metadata to rewrite URLs for both scalar and list caption/transcript formats. |
| websites/models_test.py | Adds test coverage for full_metadata URL rewriting on list-based caption/transcript formats. |
| websites/migrations/0074_remove_video_file_path_fields.py | Introduces migration to convert scalar relation content IDs to list (multi-select) format. |
| websites/config_schema/site-config-schema.yml | Expands config schema relation_filter.filter_type to include not_equals. |
| videos/utils.py | Adds a helper for producing {file, language} entries (currently unused). |
| static/js/resources/ocw-course-site-config.json | Adds new multi-select relation fields for captions/transcript resources, with not_equals filters. |
| localdev/configs/ocw-course-site-config.yml | Mirrors the same captions/transcript relation fields and not_equals filters for localdev. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
What are the relevant tickets?
https://github.com/mitodl/hq/issues/10982
Description (What does it do?)
converts
video_captions_resourceandvideo_transcript_resourcefrom single-select to multi-select relation fields, allowing a video resource to be linked to multiple caption or transcript files (e.g. one per language).Migration
0074converts any existing scalar relation value{"content": "text_id"}to the array form{"content": ["text_id"]}for both fields.How can this be tested?
ocw-hugo-projectsbranch:umar/10982-migrate-caption-and-transcript-resource-relationumar/10982-migrate-caption-and-transcript-resource-relationvideo_captions_resourceandvideo_transcript_resourceare rendered as multi-select relation pickers.