Skip to content

feat: migrate caption and transcript resource relation#3031

Open
umar8hassan wants to merge 3 commits intoumar/10958-remove-captions-and-transcript-file-fieldsfrom
umar/10982-migrate-caption-and-transcript-resource-relation
Open

feat: migrate caption and transcript resource relation#3031
umar8hassan wants to merge 3 commits intoumar/10958-remove-captions-and-transcript-file-fieldsfrom
umar/10982-migrate-caption-and-transcript-resource-relation

Conversation

@umar8hassan
Copy link
Copy Markdown
Contributor

@umar8hassan umar8hassan commented May 4, 2026

What are the relevant tickets?

https://github.com/mitodl/hq/issues/10982

Description (What does it do?)

converts video_captions_resource and video_transcript_resource from single-select to multi-select relation fields, allowing a video resource to be linked to multiple caption or transcript files (e.g. one per language).

Migration 0074 converts any existing scalar relation value {"content": "text_id"} to the array form {"content": ["text_id"]} for both fields.

How can this be tested?

  1. Add content to starter config in admin from ocw-hugo-projects branch: umar/10982-migrate-caption-and-transcript-resource-relation
  2. Switch to this branch: umar/10982-migrate-caption-and-transcript-resource-relation
  3. Rebuild containers or run docker-compose exec web python manage.py migrate and verify migration 0074 applies cleanly without errors.
  4. Visit a video resource in Studio, and confirm video_captions_resource and video_transcript_resource are rendered as multi-select relation pickers.
  5. Add two caption resources to the same video and save, and verify both appear in the field and are persisted correctly.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request transitions video captions and transcripts from simple file path strings to multi-select relation fields. Key changes include updating the site configuration schema, adding utility functions to process multi-language file paths, and implementing data migrations to back-populate resource fields from existing relationships. Review feedback suggests scoping resource queries by website and preserving selection order in serializers, optimizing migration performance by avoiding N+1 queries, and manually updating website republish flags since historical model save methods do not trigger standard side effects.

Comment thread websites/serializers.py Outdated
Comment on lines +85 to +96
if isinstance(relation_id, list):
# Multi-select format — content is an array of text_ids.
resources = list(WebsiteContent.objects.filter(text_id__in=relation_id))
set_dict_field(metadata, target_field, resource_file_paths(resources))
else:
# Scalar format — content is a single text_id string.
resource = (
WebsiteContent.objects.filter(text_id=relation_id).first()
if relation_id
else None
)
set_dict_field(metadata, target_field, resource_file_path(resource))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The WebsiteContent query should be scoped to the specific website to ensure correctness, as text_id is only unique within a website. Additionally, the order of resources should be preserved to match the order in relation_id, which is important for multi-select fields.

Note that relation_value typically contains a website key (the website name or URL path) which can be used for filtering.

        if isinstance(relation_id, list):
            # Multi-select format — content is an array of text_ids.
            website_identifier = relation_value.get("website")
            resources = WebsiteContent.objects.filter(
                Q(website__name=website_identifier) | Q(website__url_path=website_identifier),
                text_id__in=relation_id
            )
            # Preserve the order of resources as specified in relation_id
            resource_map = {r.text_id: r for r in resources}
            ordered_resources = [resource_map[tid] for tid in relation_id if tid in resource_map]
            set_dict_field(metadata, target_field, resource_file_paths(ordered_resources))
        else:
            # Scalar format — content is a single text_id string.
            website_identifier = relation_value.get("website")
            resource = (
                WebsiteContent.objects.filter(
                    Q(website__name=website_identifier) | Q(website__url_path=website_identifier),
                    text_id=relation_id
                ).first()
                if relation_id
                else None
            )
            set_dict_field(metadata, target_field, resource_file_path(resource))

Comment on lines +45 to +54
captions_ids = list(
content.referenced_by.filter(filename__contains="_captions_").values_list(
"text_id", flat=True
)
)
transcript_ids = list(
content.referenced_by.filter(
filename__contains="_transcript_"
).values_list("text_id", flat=True)
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This migration performs two additional queries (referenced_by.filter(...)) for every video content record. For a large number of records, this N+1 pattern can be very slow. Consider prefetching the referenced_by relationship to process the data in memory.

changed = True

if changed:
content.save(update_fields=["metadata"])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

According to the organization's general rules, custom model methods like save() are not available on historical models during migrations. Therefore, side effects like updating the website's republish flags (has_unpublished_live, has_unpublished_draft) must be performed manually. You should collect the IDs of updated websites and perform a bulk update at the end of the migration.

References
  1. When performing a data migration in Django, custom model methods like save() are not available on historical models. If side effects like triggering a website republish are needed, they must be performed manually, for example, by directly updating flags on related models.

changed = True

if changed:
content.save(update_fields=["metadata"])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to migration 0073, the republish flags on the Website model should be updated manually since the historical WebsiteContent.save() method won't trigger the side effects defined in the actual model class.

References
  1. When performing a data migration in Django, custom model methods like save() are not available on historical models. If side effects like triggering a website republish are needed, they must be performed manually, for example, by directly updating flags on related models.

@umar8hassan umar8hassan force-pushed the umar/10982-migrate-caption-and-transcript-resource-relation branch from 5267047 to 412b552 Compare May 4, 2026 13:19
@umar8hassan umar8hassan changed the base branch from master to umar/10958-remove-captions-and-transcript-file-fields May 6, 2026 03:56
- Add migration 0074: convert scalar _resource content to list format
- Update sync_video_relation_urls to handle both scalar and list content
- Update full_metadata to handle array-of-{file,language} _file values
- Update resolve_video_file_referenced_content_ids for array _file format
- Add video_captions_resource and video_transcript_resource relation fields (multiple: true) to site configs
- Add filter_type: not_equals to site-config schema
- Add resource_file_paths helper to videos/utils.py (language detection placeholder)
- Add tests: migration_0074, models, utils, serializers, site_config_api
@umar8hassan umar8hassan force-pushed the umar/10982-migrate-caption-and-transcript-resource-relation branch from 412b552 to 8927014 Compare May 6, 2026 03:56
@umar8hassan umar8hassan marked this pull request as ready for review May 6, 2026 03:57
@umar8hassan umar8hassan requested a review from Copilot May 6, 2026 03:57
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates how video caption/transcript metadata is represented and migrated, moving caption/transcript resource relations to multi-select while preserving compatibility for legacy caption/transcript file path metadata formats.

Changes:

  • Update caption/transcript URL-path handling to support both legacy scalar strings and the new multi-language array-of-objects format.
  • Add tests covering multi-language caption/transcript path resolution and full_metadata URL rewriting.
  • Add a data migration to convert video_*_resource.content from scalar to list format, and update site configs/schema to support new relation widget behavior.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
websites/utils.py Adds support for resolving referenced content IDs from both scalar and array caption/transcript file metadata formats.
websites/utils_test.py Adds test coverage for the array-of-objects caption/transcript file format.
websites/site_config_api_test.py Updates expected generated metadata to include the new captions/transcript resource fields.
websites/serializers_test.py Adds serializer test for saving multi-select caption/transcript resources and persisting {file, language} lists.
websites/models.py Updates full_metadata to rewrite URLs for both scalar and list caption/transcript formats.
websites/models_test.py Adds test coverage for full_metadata URL rewriting on list-based caption/transcript formats.
websites/migrations/0074_remove_video_file_path_fields.py Introduces migration to convert scalar relation content IDs to list (multi-select) format.
websites/config_schema/site-config-schema.yml Expands config schema relation_filter.filter_type to include not_equals.
videos/utils.py Adds a helper for producing {file, language} entries (currently unused).
static/js/resources/ocw-course-site-config.json Adds new multi-select relation fields for captions/transcript resources, with not_equals filters.
localdev/configs/ocw-course-site-config.yml Mirrors the same captions/transcript relation fields and not_equals filters for localdev.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread websites/migrations/0074_remove_video_file_path_fields.py Outdated
Comment thread websites/config_schema/site-config-schema.yml
Comment thread static/js/resources/ocw-course-site-config.json
Comment thread localdev/configs/ocw-course-site-config.yml Outdated
Comment thread websites/serializers_test.py Outdated
Comment thread videos/utils.py
Comment thread websites/config_schema/site-config-schema.yml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants