feat: migarate captions & transcript file fields to resource fields by umar8hassan · Pull Request #2999 · mitodl/ocw-studio

umar8hassan · 2026-04-21T13:15:08Z

What are the relevant tickets?

https://github.com/mitodl/hq/issues/10958

Description (What does it do?)

Removes the legacy video_captions_file and video_transcript_file string fields from the video_files metadata block on video resources.

How can this be tested?

Switch to main branch
Visit a course site in Studio that has an existing video resource with captions or transcripts already linked.

OR if you have working 3play locally, you can add a video resource with youtube_id: E8uZtq_vOYM and it'd pull the related transcript and captions.

Switch to the branch: umar/10958-remove-captions-and-transcript-file-fields
Add content to starter config in admin from cow-hugo-projects branch: umar/10958-remove-captions-and-transcript-file-fields
Open the video resource editor, confirm that video_captions_file and video_transcript_file text fields are no longer present in the form.
Rebuild containers or run docker-compose exec web python manage.py migrate and verify migration 0073 applies cleanly without errors.
After migration, open the same video resource and confirm the linked caption/transcript content still appears correctly under the relation fields.

gemini-code-assist

Code Review

This pull request transitions video caption and transcript metadata from scalar string fields to relation-based arrays. It introduces data migrations to convert existing records and remove legacy path fields. Review feedback identifies performance issues related to N+1 queries in the migrations, recommending bulk updates and optimized field loading. Additionally, it suggests removing redundant logic for fields that are immediately deleted in a subsequent migration.

Copilot

Pull request overview

This PR updates OCW Studio’s video resource metadata handling to fully deprecate the legacy video_captions_file / video_transcript_file string fields and migrate existing data toward the newer relation-based _resource fields, aligning the editor/config schema with the intended data model (ref: https://github.com/mitodl/hq/issues/10958).

Changes:

Removes legacy captions/transcript URL fields from the OCW course site config schemas (JSON + localdev YAML) and related metadata test expectations.
Adds a data migration (0073) to back-populate video_captions_resource / video_transcript_resource from the existing referenced_by M2M and delete legacy _file keys from metadata.video_files.
Adds unit tests covering the migration’s backpopulation and cleanup behavior.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
websites/site_config_api_test.py	Updates expected generated metadata to no longer include legacy captions/transcript file fields.
websites/migrations/0073_convert_video_caption_transcript_to_array.py	Adds data migration to populate `_resource` fields from `referenced_by` and remove legacy `_file` fields.
websites/migration_0073_test.py	Adds tests validating migration behavior for single/multiple refs and cleanup.
static/js/resources/ocw-course-site-config.json	Removes legacy captions/transcript URL fields from the “Video Files” object schema.
localdev/configs/ocw-course-site-config.yml	Removes legacy captions/transcript URL fields from the localdev OCW course config.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

sentry · 2026-05-05T03:53:37Z

+        file_entries = [
+            {"file": f"/{r.file.name.lstrip('/')}", "language": "en"}
+            for r in resources
+            if getattr(r, "file", None) and r.file.name
+        ]
+        set_dict_field(metadata, target_field, file_entries or None)


Bug: The full_metadata property in websites/models.py will crash with an AttributeError because it calls .replace() on a caption/transcript field that is now a list, not a string.
_{Severity: CRITICAL}

Suggested Fix

In the full_metadata property within websites/models.py, before calling .replace() on the value for YT_FIELD_TRANSCRIPT or YT_FIELD_CAPTIONS, check if the value is a string. If it is a list, iterate through it and apply the replacement to the file path within each dictionary.

Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not valid. Location: websites/serializers.py#L82-L87 Potential issue: The `full_metadata` property in `websites/models.py` processes video caption and transcript file paths. It assumes these paths are strings and calls the `.replace()` method on them. However, this pull request changes the data structure for these fields from a string to a list of dictionaries (e.g., `[{"file": "...", "language": "en"}]`). When the `full_metadata` property is accessed during content serialization (e.g., publishing or previewing a site), it will attempt to call `.replace()` on a list, which will raise an `AttributeError: 'list' object has no attribute 'replace'`, causing a server crash.

Also affects:

videos/tasks.py:522~533

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 8 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

After migration 0073, video_captions_file and video_transcript_file are no longer scalar string paths. delete_related_captions_and_transcript was reading those removed fields, always getting None, and skipping deletion entirely. Now resolves related resources via two strategies: 1. _resource relation fields (video_captions_resource / video_transcript_resource) — primary post-migration path, looks up by website + text_id 2. _file path fields fallback — handles both new array format and legacy scalar strings for any pre-migration data Also adds test_delete_pre_existing_video_legacy_file_fields to cover the scalar fallback path, and updates existing tests to use the _resource field format.

sentry · 2026-05-06T03:55:41Z

+                        if source_captions.file and source_captions.file.name:
+                            file_path = f"/{source_captions.file.name.lstrip('/')}"
+                            video.metadata["video_files"]["video_captions_file"] = [
+                                {"file": file_path, "language": "en"}
+                            ]
+                        video.save()


Bug: The sync_transcripts.py command creates a resource link without populating the video_captions_file array if the source resource has no file, causing captions to be missing from the build.
_{Severity: HIGH}

Suggested Fix

Modify the conditional logic in sync_transcripts.py. When a source caption/transcript resource exists but has no file, either avoid creating the resource link entirely, or ensure the video_captions_file field is explicitly set to an empty list or None to prevent using stale data and reflect the absence of a file.

Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not valid. Location: videos/management/commands/sync_transcripts.py#L80-L85 Potential issue: In `videos/management/commands/sync_transcripts.py`, when syncing a caption or transcript resource that exists but lacks an associated file, the code updates the resource link (`video_captions_resource`) but fails to populate the corresponding file data array (`video_captions_file`). This is due to a conditional check for `source_captions.file`. This creates an inconsistent state where the resource appears linked in the CMS, but the build pipeline, which relies on the file data array, cannot include the caption/transcript file. As a result, the captions or transcripts will be missing from the published website.

Also affects:

videos/management/commands/sync_transcripts.py:103~108

sentry · 2026-05-06T03:55:41Z

+        new_ids = []
+        file_entries = []
+        for text_id in content_ids:
+            source_content = WebsiteContent.objects.filter(
+                website=source_course, text_id=text_id
+            ).first()
+            if not source_content:
+                continue
+            new_content = create_new_content(source_content, destination_course)
+            new_ids.append(str(new_content.text_id))
+


Bug: The update_transcripts_for_video task writes a scalar URL string to caption/transcript file fields, which is an incorrect format. This leads to data loss when the video is next saved.
_{Severity: HIGH}

Suggested Fix

Update the update_transcripts_for_video task to write data in the correct list-of-dictionaries format that the new schema expects, for example: [{"file": file_path, "language": "en"}]. This will ensure data consistency with the rest of the system and prevent data loss during subsequent saves.

Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not valid. Location: videos/tasks.py#L596-L609 Potential issue: The `update_transcripts_for_video` task in `videos/tasks.py` incorrectly writes a scalar URL string to the `video_captions_file` and `video_transcript_file` metadata fields. The system now expects these fields to be a list of dictionary objects (e.g., `[{"file": "..."}]`). When the video's metadata is subsequently updated via the serializer, the `sync_video_relation_urls` function is triggered. This function overwrites the incorrect scalar string with `None` if no corresponding `_resource` field is set, leading to data loss and causing the captions or transcripts to be missing from the website build.

gemini-code-assist Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread websites/migrations/0071_convert_video_caption_transcript_to_array.py Outdated

Comment thread websites/migrations/0072_remove_video_file_path_fields.py Outdated

Comment thread websites/migrations/0071_convert_video_caption_transcript_to_array.py Outdated

umar8hassan force-pushed the umar/10958-remove-captions-and-transcript-file-fields branch 2 times, most recently from 566d506 to f010e79 Compare May 4, 2026 12:45

umar8hassan changed the title ~~feat: add data migration to convert scalar caption/transcript values …~~ feat: migarate captions & transcript file fields to resource fields May 4, 2026

umar8hassan marked this pull request as ready for review May 4, 2026 13:21

umar8hassan requested a review from Copilot May 4, 2026 13:22

Copilot started reviewing on behalf of umar8hassan May 4, 2026 13:22 View session

sentry Bot reviewed May 4, 2026

View reviewed changes

Comment thread websites/migrations/0073_convert_video_caption_transcript_to_array.py Outdated

Copilot AI reviewed May 4, 2026

View reviewed changes

Comment thread websites/migrations/0073_video_caption_transcript_files_to_resources.py

Comment thread websites/migrations/0073_convert_video_caption_transcript_to_array.py Outdated

Comment thread websites/migrations/0073_convert_video_caption_transcript_to_array.py Outdated

umar8hassan force-pushed the umar/10958-remove-captions-and-transcript-file-fields branch from f010e79 to 0ddbc0b Compare May 4, 2026 17:51

sentry Bot reviewed May 4, 2026

View reviewed changes

Comment thread videos/management/commands/sync_transcripts.py

Comment thread videos/tasks.py

umar8hassan force-pushed the umar/10958-remove-captions-and-transcript-file-fields branch from 0ddbc0b to 6168e9f Compare May 4, 2026 18:12

sentry Bot reviewed May 4, 2026

View reviewed changes

Comment thread videos/management/commands/sync_transcripts.py

umar8hassan force-pushed the umar/10958-remove-captions-and-transcript-file-fields branch from 6168e9f to d9c5b98 Compare May 5, 2026 03:50

umar8hassan requested a review from Copilot May 5, 2026 03:50

Copilot started reviewing on behalf of umar8hassan May 5, 2026 03:51 View session

sentry Bot reviewed May 5, 2026

View reviewed changes

Copilot AI reviewed May 5, 2026

View reviewed changes

umar8hassan force-pushed the umar/10958-remove-captions-and-transcript-file-fields branch from d9c5b98 to f73ee3b Compare May 5, 2026 06:04

sentry Bot reviewed May 5, 2026

View reviewed changes

Comment thread videos/management/commands/sync_transcripts.py

umar8hassan added 2 commits May 6, 2026 07:15

feat: migrate video captions/transcripts to resource relation fields

5b64a69

umar8hassan force-pushed the umar/10958-remove-captions-and-transcript-file-fields branch from f73ee3b to 45d1320 Compare May 6, 2026 03:10

sentry Bot reviewed May 6, 2026

View reviewed changes

Comment thread websites/serializers.py

Comment thread websites/migrations/0073_video_caption_transcript_files_to_resources.py

umar8hassan added 2 commits May 6, 2026 08:48

chore: fixed matching query for captions and transcript resources

c125ac3

chore: renamed migration file

44cd87d

sentry Bot reviewed May 6, 2026

View reviewed changes

Conversation

umar8hassan commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What are the relevant tickets?

Description (What does it do?)

How can this be tested?

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sentry Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sentry Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

sentry Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

umar8hassan commented Apr 21, 2026 •

edited

Loading