Skip to content

feat: migarate captions & transcript file fields to resource fields#2999

Open
umar8hassan wants to merge 4 commits intomasterfrom
umar/10958-remove-captions-and-transcript-file-fields
Open

feat: migarate captions & transcript file fields to resource fields#2999
umar8hassan wants to merge 4 commits intomasterfrom
umar/10958-remove-captions-and-transcript-file-fields

Conversation

@umar8hassan
Copy link
Copy Markdown
Contributor

@umar8hassan umar8hassan commented Apr 21, 2026

What are the relevant tickets?

https://github.com/mitodl/hq/issues/10958

Description (What does it do?)

Removes the legacy video_captions_file and video_transcript_file string fields from the video_files metadata block on video resources.

How can this be tested?

  1. Switch to main branch
  2. Visit a course site in Studio that has an existing video resource with captions or transcripts already linked.
  • OR if you have working 3play locally, you can add a video resource with youtube_id: E8uZtq_vOYM and it'd pull the related transcript and captions.
  1. Switch to the branch: umar/10958-remove-captions-and-transcript-file-fields
  2. Add content to starter config in admin from cow-hugo-projects branch: umar/10958-remove-captions-and-transcript-file-fields
  3. Open the video resource editor, confirm that video_captions_file and video_transcript_file text fields are no longer present in the form.
  4. Rebuild containers or run docker-compose exec web python manage.py migrate and verify migration 0073 applies cleanly without errors.
  5. After migration, open the same video resource and confirm the linked caption/transcript content still appears correctly under the relation fields.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request transitions video caption and transcript metadata from scalar string fields to relation-based arrays. It introduces data migrations to convert existing records and remove legacy path fields. Review feedback identifies performance issues related to N+1 queries in the migrations, recommending bulk updates and optimized field loading. Additionally, it suggests removing redundant logic for fields that are immediately deleted in a subsequent migration.

Comment thread websites/migrations/0071_convert_video_caption_transcript_to_array.py Outdated
Comment thread websites/migrations/0072_remove_video_file_path_fields.py Outdated
Comment thread websites/migrations/0071_convert_video_caption_transcript_to_array.py Outdated
@umar8hassan umar8hassan force-pushed the umar/10958-remove-captions-and-transcript-file-fields branch 2 times, most recently from 566d506 to f010e79 Compare May 4, 2026 12:45
@umar8hassan umar8hassan changed the title feat: add data migration to convert scalar caption/transcript values … feat: migarate captions & transcript file fields to resource fields May 4, 2026
@umar8hassan umar8hassan marked this pull request as ready for review May 4, 2026 13:21
@umar8hassan umar8hassan requested a review from Copilot May 4, 2026 13:22
Comment thread websites/migrations/0073_convert_video_caption_transcript_to_array.py Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates OCW Studio’s video resource metadata handling to fully deprecate the legacy video_captions_file / video_transcript_file string fields and migrate existing data toward the newer relation-based _resource fields, aligning the editor/config schema with the intended data model (ref: https://github.com/mitodl/hq/issues/10958).

Changes:

  • Removes legacy captions/transcript URL fields from the OCW course site config schemas (JSON + localdev YAML) and related metadata test expectations.
  • Adds a data migration (0073) to back-populate video_captions_resource / video_transcript_resource from the existing referenced_by M2M and delete legacy _file keys from metadata.video_files.
  • Adds unit tests covering the migration’s backpopulation and cleanup behavior.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
websites/site_config_api_test.py Updates expected generated metadata to no longer include legacy captions/transcript file fields.
websites/migrations/0073_convert_video_caption_transcript_to_array.py Adds data migration to populate _resource fields from referenced_by and remove legacy _file fields.
websites/migration_0073_test.py Adds tests validating migration behavior for single/multiple refs and cleanup.
static/js/resources/ocw-course-site-config.json Removes legacy captions/transcript URL fields from the “Video Files” object schema.
localdev/configs/ocw-course-site-config.yml Removes legacy captions/transcript URL fields from the localdev OCW course config.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread websites/migrations/0073_convert_video_caption_transcript_to_array.py Outdated
Comment thread websites/migrations/0073_convert_video_caption_transcript_to_array.py Outdated
@umar8hassan umar8hassan force-pushed the umar/10958-remove-captions-and-transcript-file-fields branch from f010e79 to 0ddbc0b Compare May 4, 2026 17:51
Comment thread videos/management/commands/sync_transcripts.py
Comment thread videos/tasks.py
@umar8hassan umar8hassan force-pushed the umar/10958-remove-captions-and-transcript-file-fields branch from 0ddbc0b to 6168e9f Compare May 4, 2026 18:12
Comment thread videos/management/commands/sync_transcripts.py
@umar8hassan umar8hassan force-pushed the umar/10958-remove-captions-and-transcript-file-fields branch from 6168e9f to d9c5b98 Compare May 5, 2026 03:50
@umar8hassan umar8hassan requested a review from Copilot May 5, 2026 03:50
Comment thread websites/serializers.py
Comment on lines +82 to +87
file_entries = [
{"file": f"/{r.file.name.lstrip('/')}", "language": "en"}
for r in resources
if getattr(r, "file", None) and r.file.name
]
set_dict_field(metadata, target_field, file_entries or None)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The full_metadata property in websites/models.py will crash with an AttributeError because it calls .replace() on a caption/transcript field that is now a list, not a string.
Severity: CRITICAL

Suggested Fix

In the full_metadata property within websites/models.py, before calling .replace() on the value for YT_FIELD_TRANSCRIPT or YT_FIELD_CAPTIONS, check if the value is a string. If it is a list, iterate through it and apply the replacement to the file path within each dictionary.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.

Location: websites/serializers.py#L82-L87

Potential issue: The `full_metadata` property in `websites/models.py` processes video
caption and transcript file paths. It assumes these paths are strings and calls the
`.replace()` method on them. However, this pull request changes the data structure for
these fields from a string to a list of dictionaries (e.g., `[{"file": "...",
"language": "en"}]`). When the `full_metadata` property is accessed during content
serialization (e.g., publishing or previewing a site), it will attempt to call
`.replace()` on a list, which will raise an `AttributeError: 'list' object has no
attribute 'replace'`, causing a server crash.

Also affects:

  • videos/tasks.py:522~533

Comment thread videos/management/commands/sync_missing_captions.py
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 8 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread websites/serializers.py
Comment thread websites/serializers.py
Comment thread videos/tasks.py Outdated
Comment thread videos/tasks.py
Comment thread videos/management/commands/sync_transcripts.py
Comment thread videos/management/commands/sync_transcripts.py
Comment thread videos/management/commands/sync_transcripts.py
Comment thread websites/migrations/0073_convert_video_caption_transcript_to_array.py Outdated
@umar8hassan umar8hassan force-pushed the umar/10958-remove-captions-and-transcript-file-fields branch from d9c5b98 to f73ee3b Compare May 5, 2026 06:04
Comment thread videos/management/commands/sync_transcripts.py
After migration 0073, video_captions_file and video_transcript_file are no
longer scalar string paths. delete_related_captions_and_transcript was reading
those removed fields, always getting None, and skipping deletion entirely.

Now resolves related resources via two strategies:
1. _resource relation fields (video_captions_resource / video_transcript_resource)
   — primary post-migration path, looks up by website + text_id
2. _file path fields fallback — handles both new array format and legacy scalar
   strings for any pre-migration data

Also adds test_delete_pre_existing_video_legacy_file_fields to cover the scalar
fallback path, and updates existing tests to use the _resource field format.
@umar8hassan umar8hassan force-pushed the umar/10958-remove-captions-and-transcript-file-fields branch from f73ee3b to 45d1320 Compare May 6, 2026 03:10
Comment thread websites/serializers.py
Comment on lines +80 to +85
if source_captions.file and source_captions.file.name:
file_path = f"/{source_captions.file.name.lstrip('/')}"
video.metadata["video_files"]["video_captions_file"] = [
{"file": file_path, "language": "en"}
]
video.save()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The sync_transcripts.py command creates a resource link without populating the video_captions_file array if the source resource has no file, causing captions to be missing from the build.
Severity: HIGH

Suggested Fix

Modify the conditional logic in sync_transcripts.py. When a source caption/transcript resource exists but has no file, either avoid creating the resource link entirely, or ensure the video_captions_file field is explicitly set to an empty list or None to prevent using stale data and reflect the absence of a file.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.

Location: videos/management/commands/sync_transcripts.py#L80-L85

Potential issue: In `videos/management/commands/sync_transcripts.py`, when syncing a
caption or transcript resource that exists but lacks an associated file, the code
updates the resource link (`video_captions_resource`) but fails to populate the
corresponding file data array (`video_captions_file`). This is due to a conditional
check for `source_captions.file`. This creates an inconsistent state where the resource
appears linked in the CMS, but the build pipeline, which relies on the file data array,
cannot include the caption/transcript file. As a result, the captions or transcripts
will be missing from the published website.

Also affects:

  • videos/management/commands/sync_transcripts.py:103~108

Comment thread videos/tasks.py
Comment on lines +599 to +609
new_ids = []
file_entries = []
for text_id in content_ids:
source_content = WebsiteContent.objects.filter(
website=source_course, text_id=text_id
).first()
if not source_content:
continue
new_content = create_new_content(source_content, destination_course)
new_ids.append(str(new_content.text_id))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The update_transcripts_for_video task writes a scalar URL string to caption/transcript file fields, which is an incorrect format. This leads to data loss when the video is next saved.
Severity: HIGH

Suggested Fix

Update the update_transcripts_for_video task to write data in the correct list-of-dictionaries format that the new schema expects, for example: [{"file": file_path, "language": "en"}]. This will ensure data consistency with the rest of the system and prevent data loss during subsequent saves.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.

Location: videos/tasks.py#L596-L609

Potential issue: The `update_transcripts_for_video` task in `videos/tasks.py`
incorrectly writes a scalar URL string to the `video_captions_file` and
`video_transcript_file` metadata fields. The system now expects these fields to be a
list of dictionary objects (e.g., `[{"file": "..."}]`). When the video's metadata is
subsequently updated via the serializer, the `sync_video_relation_urls` function is
triggered. This function overwrites the incorrect scalar string with `None` if no
corresponding `_resource` field is set, leading to data loss and causing the captions or
transcripts to be missing from the website build.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants