Skip to content

Conversation

@sakshamarora1
Copy link
Contributor

@sakshamarora1 sakshamarora1 commented Feb 2, 2026

closes: #286

Steps

  1. Update the collection queries for a collection, retreive all the comments for the records in the records found and create a json metadata file.
ipython ./scripts/dump_comments_to_migrate.py

Output file: comments_metadata.json
Another output file generated for the missing users: users_metadata.json

  1. For missing users:
    users_metadata.json file will be read and then this script (with some tweaks) can be run to find out the missing users in the new system.
    https://gitlab.cern.ch/cds-team/production_scripts/-/blob/master/cds-rdm/migration/dump_users.py?ref_type=heads

  2. Create those users using:

cds-migrator-kit comments commenters-run --filepath /comments/users_metadata.json --missing-users-filepath /eos/media/cds/cds-rdm/dev/migration/users/people.csv --dry-run
  1. Place comments_metadata.json in /eos/media/cds/cds-rdm/<env>/migration/<collection>/comments/

  2. Finally migrate the comments:

invenio migration comments --filepath /eos/media/cds/cds-rdm/<env>/migration/<collection>/comments/comments_metadata.json --dirpath /eos/media/cds/cds-rdm/<env>/migration/<collection>/comments/ --dry-run

@sakshamarora1 sakshamarora1 marked this pull request as ready for review February 4, 2026 16:33
self.all_record_versions = {
str(hit["versions"]["index"]): hit for hit in search_result
}
oldest_version = min(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't it be faster via record._record.versions[-1]? I mean instead of scan_versions etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That returns an instance of VersionsManager and it doesn't have other versions stored in it. We will have to do scan_versions to find all the versions and select the minimum un-deleted version available

elif comment_status == "dm":
comment_payload["payload"].update(
{
"content": "comment was deleted by the moderator.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in RDM we do not have the "moderator" - it would be good to align it with what we display when we delete a comment in RDM (I don't remember the exact text). ping @zzacharo for more opinions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We not display this content for the deleted comments, we do this in the frontend

{}, request=request.model, request_id=str(request.id), type=event_type
)

if data.get("file_relation"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a small comment on why we are doing this?

)
return self.all_record_versions[str(oldest_version)]

def create_event(self, request, data, community, record, parent_comment_id=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any way we can optimise this function to be more readable? there are a lot of conditional statements, some with repeated conditions, also it would be good if we avoid nesting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some more optimisations

{"user": str(user.id)}, raise_=True
)
else:
print("User not found for email: ", data.get("created_by"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the print is redundant if you raise.
what will happen if you raise? will the whole script halt? and need to be re-run?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it gets caught and logged in _load() and now that I have put it under the UnitOfWork context as you suggested, it will rollback when this is raised.

event.model.version_id = 0

event.commit()
db.session.commit()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be better if we do the uow instead? otherwise you will need to re-index all requests
plus, from records migration experience I can tell you uow is faster

created_at = datetime.fromisoformat(record["created"])
request.model.created = created_at

request.commit()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this part would also benefic from uow

environment, collection
)
"""
collection_name/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, this docstring very helpful, thank you!



# Function to flatten arbitrarily nested comment replies into a 1-level replies list
def flatten_replies(comments_list):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's do the rubber duck excersise on this one :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Comments migration

2 participants