feat: migrate caption and transcript resource relation by umar8hassan · Pull Request #3031 · mitodl/ocw-studio

umar8hassan · 2026-05-04T06:01:26Z

What are the relevant tickets?

https://github.com/mitodl/hq/issues/10982

Description (What does it do?)

converts video_captions_resource and video_transcript_resource from single-select to multi-select relation fields, allowing a video resource to be linked to multiple caption or transcript files (e.g. one per language).

Migration 0074 converts any existing scalar relation value {"content": "text_id"} to the array form {"content": ["text_id"]} for both fields.

How can this be tested?

Add content to starter config in admin from ocw-hugo-projects branch: umar/10982-migrate-caption-and-transcript-resource-relation
Switch to this branch: umar/10982-migrate-caption-and-transcript-resource-relation
Rebuild containers or run docker-compose exec web python manage.py migrate and verify migration 0074 applies cleanly without errors.
Visit a video resource in Studio, and confirm video_captions_resource and video_transcript_resource are rendered as multi-select relation pickers.
Add two caption resources to the same video and save, and verify both appear in the field and are persisted correctly.

gemini-code-assist

Code Review

This pull request transitions video captions and transcripts from simple file path strings to multi-select relation fields. Key changes include updating the site configuration schema, adding utility functions to process multi-language file paths, and implementing data migrations to back-populate resource fields from existing relationships. Review feedback suggests scoping resource queries by website and preserving selection order in serializers, optimizing migration performance by avoiding N+1 queries, and manually updating website republish flags since historical model save methods do not trigger standard side effects.

gemini-code-assist · 2026-05-04T06:04:11Z

+        if isinstance(relation_id, list):
+            # Multi-select format — content is an array of text_ids.
+            resources = list(WebsiteContent.objects.filter(text_id__in=relation_id))
+            set_dict_field(metadata, target_field, resource_file_paths(resources))
+        else:
+            # Scalar format — content is a single text_id string.
+            resource = (
+                WebsiteContent.objects.filter(text_id=relation_id).first()
+                if relation_id
+                else None
+            )
+            set_dict_field(metadata, target_field, resource_file_path(resource))


The WebsiteContent query should be scoped to the specific website to ensure correctness, as text_id is only unique within a website. Additionally, the order of resources should be preserved to match the order in relation_id, which is important for multi-select fields.

Note that relation_value typically contains a website key (the website name or URL path) which can be used for filtering.

if isinstance(relation_id, list): # Multi-select format — content is an array of text_ids. website_identifier = relation_value.get("website") resources = WebsiteContent.objects.filter( Q(website__name=website_identifier) | Q(website__url_path=website_identifier), text_id__in=relation_id ) # Preserve the order of resources as specified in relation_id resource_map = {r.text_id: r for r in resources} ordered_resources = [resource_map[tid] for tid in relation_id if tid in resource_map] set_dict_field(metadata, target_field, resource_file_paths(ordered_resources)) else: # Scalar format — content is a single text_id string. website_identifier = relation_value.get("website") resource = ( WebsiteContent.objects.filter( Q(website__name=website_identifier) | Q(website__url_path=website_identifier), text_id=relation_id ).first() if relation_id else None ) set_dict_field(metadata, target_field, resource_file_path(resource))

gemini-code-assist · 2026-05-04T06:04:11Z

+        captions_ids = list(
+            content.referenced_by.filter(filename__contains="_captions_").values_list(
+                "text_id", flat=True
+            )
+        )
+        transcript_ids = list(
+            content.referenced_by.filter(
+                filename__contains="_transcript_"
+            ).values_list("text_id", flat=True)
+        )


This migration performs two additional queries (referenced_by.filter(...)) for every video content record. For a large number of records, this N+1 pattern can be very slow. Consider prefetching the referenced_by relationship to process the data in memory.

gemini-code-assist · 2026-05-04T06:04:11Z

+                changed = True
+
+        if changed:
+            content.save(update_fields=["metadata"])


According to the organization's general rules, custom model methods like save() are not available on historical models during migrations. Therefore, side effects like updating the website's republish flags (has_unpublished_live, has_unpublished_draft) must be performed manually. You should collect the IDs of updated websites and perform a bulk update at the end of the migration.

References

When performing a data migration in Django, custom model methods like save() are not available on historical models. If side effects like triggering a website republish are needed, they must be performed manually, for example, by directly updating flags on related models.

gemini-code-assist · 2026-05-04T06:04:11Z

+                    changed = True
+
+        if changed:
+            content.save(update_fields=["metadata"])


Similar to migration 0073, the republish flags on the Website model should be updated manually since the historical WebsiteContent.save() method won't trigger the side effects defined in the actual model class.

References

When performing a data migration in Django, custom model methods like save() are not available on historical models. If side effects like triggering a website republish are needed, they must be performed manually, for example, by directly updating flags on related models.

- Add migration 0074: convert scalar _resource content to list format - Update sync_video_relation_urls to handle both scalar and list content - Update full_metadata to handle array-of-{file,language} _file values - Update resolve_video_file_referenced_content_ids for array _file format - Add video_captions_resource and video_transcript_resource relation fields (multiple: true) to site configs - Add filter_type: not_equals to site-config schema - Add resource_file_paths helper to videos/utils.py (language detection placeholder) - Add tests: migration_0074, models, utils, serializers, site_config_api

Copilot

Pull request overview

This PR updates how video caption/transcript metadata is represented and migrated, moving caption/transcript resource relations to multi-select while preserving compatibility for legacy caption/transcript file path metadata formats.

Changes:

Update caption/transcript URL-path handling to support both legacy scalar strings and the new multi-language array-of-objects format.
Add tests covering multi-language caption/transcript path resolution and full_metadata URL rewriting.
Add a data migration to convert video_*_resource.content from scalar to list format, and update site configs/schema to support new relation widget behavior.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
websites/utils.py	Adds support for resolving referenced content IDs from both scalar and array caption/transcript file metadata formats.
websites/utils_test.py	Adds test coverage for the array-of-objects caption/transcript file format.
websites/site_config_api_test.py	Updates expected generated metadata to include the new captions/transcript resource fields.
websites/serializers_test.py	Adds serializer test for saving multi-select caption/transcript resources and persisting `{file, language}` lists.
websites/models.py	Updates `full_metadata` to rewrite URLs for both scalar and list caption/transcript formats.
websites/models_test.py	Adds test coverage for `full_metadata` URL rewriting on list-based caption/transcript formats.
websites/migrations/0074_remove_video_file_path_fields.py	Introduces migration to convert scalar relation content IDs to list (multi-select) format.
websites/config_schema/site-config-schema.yml	Expands config schema `relation_filter.filter_type` to include `not_equals`.
videos/utils.py	Adds a helper for producing `{file, language}` entries (currently unused).
static/js/resources/ocw-course-site-config.json	Adds new multi-select relation fields for captions/transcript resources, with `not_equals` filters.
localdev/configs/ocw-course-site-config.yml	Mirrors the same captions/transcript relation fields and `not_equals` filters for localdev.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

gemini-code-assist Bot reviewed May 4, 2026

View reviewed changes

umar8hassan force-pushed the umar/10982-migrate-caption-and-transcript-resource-relation branch from 5267047 to 412b552 Compare May 4, 2026 13:19

umar8hassan changed the base branch from master to umar/10958-remove-captions-and-transcript-file-fields May 6, 2026 03:56

umar8hassan force-pushed the umar/10982-migrate-caption-and-transcript-resource-relation branch from 412b552 to 8927014 Compare May 6, 2026 03:56

umar8hassan marked this pull request as ready for review May 6, 2026 03:57

umar8hassan requested a review from Copilot May 6, 2026 03:57

Copilot started reviewing on behalf of umar8hassan May 6, 2026 03:57 View session

Copilot AI reviewed May 6, 2026

View reviewed changes

chore: fixed migration dependency

67fa757

sentry Bot reviewed May 6, 2026

View reviewed changes

Comment thread websites/config_schema/site-config-schema.yml

chore: fixed constraints and removed redundant ref

c8df8a7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: migrate caption and transcript resource relation#3031

feat: migrate caption and transcript resource relation#3031
umar8hassan wants to merge 3 commits intoumar/10958-remove-captions-and-transcript-file-fieldsfrom
umar/10982-migrate-caption-and-transcript-resource-relation

umar8hassan commented May 4, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 4, 2026

Uh oh!

gemini-code-assist Bot May 4, 2026

Uh oh!

gemini-code-assist Bot May 4, 2026

Uh oh!

gemini-code-assist Bot May 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

umar8hassan commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What are the relevant tickets?

Description (What does it do?)

How can this be tested?

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

umar8hassan commented May 4, 2026 •

edited

Loading