feat: migarate captions & transcript file fields to resource fields#2999
feat: migarate captions & transcript file fields to resource fields#2999umar8hassan wants to merge 4 commits intomasterfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request transitions video caption and transcript metadata from scalar string fields to relation-based arrays. It introduces data migrations to convert existing records and remove legacy path fields. Review feedback identifies performance issues related to N+1 queries in the migrations, recommending bulk updates and optimized field loading. Additionally, it suggests removing redundant logic for fields that are immediately deleted in a subsequent migration.
566d506 to
f010e79
Compare
There was a problem hiding this comment.
Pull request overview
This PR updates OCW Studio’s video resource metadata handling to fully deprecate the legacy video_captions_file / video_transcript_file string fields and migrate existing data toward the newer relation-based _resource fields, aligning the editor/config schema with the intended data model (ref: https://github.com/mitodl/hq/issues/10958).
Changes:
- Removes legacy captions/transcript URL fields from the OCW course site config schemas (JSON + localdev YAML) and related metadata test expectations.
- Adds a data migration (0073) to back-populate
video_captions_resource/video_transcript_resourcefrom the existingreferenced_byM2M and delete legacy_filekeys frommetadata.video_files. - Adds unit tests covering the migration’s backpopulation and cleanup behavior.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| websites/site_config_api_test.py | Updates expected generated metadata to no longer include legacy captions/transcript file fields. |
| websites/migrations/0073_convert_video_caption_transcript_to_array.py | Adds data migration to populate _resource fields from referenced_by and remove legacy _file fields. |
| websites/migration_0073_test.py | Adds tests validating migration behavior for single/multiple refs and cleanup. |
| static/js/resources/ocw-course-site-config.json | Removes legacy captions/transcript URL fields from the “Video Files” object schema. |
| localdev/configs/ocw-course-site-config.yml | Removes legacy captions/transcript URL fields from the localdev OCW course config. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
f010e79 to
0ddbc0b
Compare
0ddbc0b to
6168e9f
Compare
6168e9f to
d9c5b98
Compare
| file_entries = [ | ||
| {"file": f"/{r.file.name.lstrip('/')}", "language": "en"} | ||
| for r in resources | ||
| if getattr(r, "file", None) and r.file.name | ||
| ] | ||
| set_dict_field(metadata, target_field, file_entries or None) |
There was a problem hiding this comment.
Bug: The full_metadata property in websites/models.py will crash with an AttributeError because it calls .replace() on a caption/transcript field that is now a list, not a string.
Severity: CRITICAL
Suggested Fix
In the full_metadata property within websites/models.py, before calling .replace() on the value for YT_FIELD_TRANSCRIPT or YT_FIELD_CAPTIONS, check if the value is a string. If it is a list, iterate through it and apply the replacement to the file path within each dictionary.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.
Location: websites/serializers.py#L82-L87
Potential issue: The `full_metadata` property in `websites/models.py` processes video
caption and transcript file paths. It assumes these paths are strings and calls the
`.replace()` method on them. However, this pull request changes the data structure for
these fields from a string to a list of dictionaries (e.g., `[{"file": "...",
"language": "en"}]`). When the `full_metadata` property is accessed during content
serialization (e.g., publishing or previewing a site), it will attempt to call
`.replace()` on a list, which will raise an `AttributeError: 'list' object has no
attribute 'replace'`, causing a server crash.
Also affects:
videos/tasks.py:522~533
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 8 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
d9c5b98 to
f73ee3b
Compare
After migration 0073, video_captions_file and video_transcript_file are no longer scalar string paths. delete_related_captions_and_transcript was reading those removed fields, always getting None, and skipping deletion entirely. Now resolves related resources via two strategies: 1. _resource relation fields (video_captions_resource / video_transcript_resource) — primary post-migration path, looks up by website + text_id 2. _file path fields fallback — handles both new array format and legacy scalar strings for any pre-migration data Also adds test_delete_pre_existing_video_legacy_file_fields to cover the scalar fallback path, and updates existing tests to use the _resource field format.
f73ee3b to
45d1320
Compare
| if source_captions.file and source_captions.file.name: | ||
| file_path = f"/{source_captions.file.name.lstrip('/')}" | ||
| video.metadata["video_files"]["video_captions_file"] = [ | ||
| {"file": file_path, "language": "en"} | ||
| ] | ||
| video.save() |
There was a problem hiding this comment.
Bug: The sync_transcripts.py command creates a resource link without populating the video_captions_file array if the source resource has no file, causing captions to be missing from the build.
Severity: HIGH
Suggested Fix
Modify the conditional logic in sync_transcripts.py. When a source caption/transcript resource exists but has no file, either avoid creating the resource link entirely, or ensure the video_captions_file field is explicitly set to an empty list or None to prevent using stale data and reflect the absence of a file.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.
Location: videos/management/commands/sync_transcripts.py#L80-L85
Potential issue: In `videos/management/commands/sync_transcripts.py`, when syncing a
caption or transcript resource that exists but lacks an associated file, the code
updates the resource link (`video_captions_resource`) but fails to populate the
corresponding file data array (`video_captions_file`). This is due to a conditional
check for `source_captions.file`. This creates an inconsistent state where the resource
appears linked in the CMS, but the build pipeline, which relies on the file data array,
cannot include the caption/transcript file. As a result, the captions or transcripts
will be missing from the published website.
Also affects:
videos/management/commands/sync_transcripts.py:103~108
| new_ids = [] | ||
| file_entries = [] | ||
| for text_id in content_ids: | ||
| source_content = WebsiteContent.objects.filter( | ||
| website=source_course, text_id=text_id | ||
| ).first() | ||
| if not source_content: | ||
| continue | ||
| new_content = create_new_content(source_content, destination_course) | ||
| new_ids.append(str(new_content.text_id)) | ||
|
|
There was a problem hiding this comment.
Bug: The update_transcripts_for_video task writes a scalar URL string to caption/transcript file fields, which is an incorrect format. This leads to data loss when the video is next saved.
Severity: HIGH
Suggested Fix
Update the update_transcripts_for_video task to write data in the correct list-of-dictionaries format that the new schema expects, for example: [{"file": file_path, "language": "en"}]. This will ensure data consistency with the rest of the system and prevent data loss during subsequent saves.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.
Location: videos/tasks.py#L596-L609
Potential issue: The `update_transcripts_for_video` task in `videos/tasks.py`
incorrectly writes a scalar URL string to the `video_captions_file` and
`video_transcript_file` metadata fields. The system now expects these fields to be a
list of dictionary objects (e.g., `[{"file": "..."}]`). When the video's metadata is
subsequently updated via the serializer, the `sync_video_relation_urls` function is
triggered. This function overwrites the incorrect scalar string with `None` if no
corresponding `_resource` field is set, leading to data loss and causing the captions or
transcripts to be missing from the website build.
What are the relevant tickets?
https://github.com/mitodl/hq/issues/10958
Description (What does it do?)
Removes the legacy
video_captions_fileandvideo_transcript_filestring fields from thevideo_filesmetadata block on video resources.How can this be tested?
mainbranchyoutube_id: E8uZtq_vOYMand it'd pull the related transcript and captions.umar/10958-remove-captions-and-transcript-file-fieldscow-hugo-projectsbranch:umar/10958-remove-captions-and-transcript-file-fieldsvideo_captions_fileandvideo_transcript_filetext fields are no longer present in the form.docker-compose exec web python manage.py migrateand verify migration0073applies cleanly without errors.