Skip to content

Ensure SentencePiece tokenizer returns string pieces#601

Merged
RNA4219 merged 1 commit intomainfrom
codex/fix-sentencepieceprocessor-encoding-issue
Nov 2, 2025
Merged

Ensure SentencePiece tokenizer returns string pieces#601
RNA4219 merged 1 commit intomainfrom
codex/fix-sentencepieceprocessor-encoding-issue

Conversation

@RNA4219
Copy link
Copy Markdown
Owner

@RNA4219 RNA4219 commented Nov 2, 2025

Summary

  • tighten the SentencePiece processor test double to require out_type=str and capture encode_as_pieces usage
  • add regression coverage for both direct encode and encode_as_pieces fallback paths
  • update the CLI SentencePiece tokenizer to request string pieces via encode(..., out_type=str) or fall back to encode_as_pieces

Testing

  • pytest tests/quality/evaluator/test_cli.py

https://chatgpt.com/codex/tasks/task_e_6907d6022a7c8321a567341f58b3606c

@RNA4219 RNA4219 merged commit dd685f2 into main Nov 2, 2025
15 checks passed
@RNA4219 RNA4219 deleted the codex/fix-sentencepieceprocessor-encoding-issue branch November 2, 2025 23:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant