fix(parse): recognize .jsonl as a text file so upload encoding normalization applies (#2745)#2794
Open
r266-tech wants to merge 1 commit into
Open
Conversation
…ization applies Completes volcengine#2745, which added .jsonl to the vectorization text-extension set in embedding_utils.py but left the parallel upload-time encoding path treating .jsonl as non-text. is_text_file() decides text-vs-binary by exact suffix membership across CODE_EXTENSIONS + DOCUMENTATION_EXTENSIONS + ADDITIONAL_TEXT_EXTENSIONS, which had .json but not .jsonl (the suffix of data.jsonl is .jsonl, not .json). So detect_and_convert_encoding skipped UTF-8 normalization for a legacy-encoded .jsonl -- unlike .json -- which then got vectorized as text, the exact mojibake class volcengine#2770 fixed. Add .jsonl to ADDITIONAL_TEXT_EXTENSIONS (next to .json); is_text_file unions all three sets so one entry suffices. Behavior for every other extension is unchanged. Adds a test assert. Refs volcengine#2745, volcengine#2744, volcengine#2770.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Completes #2745 (
Fixes #2744), which added.jsonlto the vectorization text-extension set inembedding_utils.py, but left the parallel upload-time encoding path treating.jsonlas non-text.Problem
is_text_file()(openviking/parse/parsers/upload_utils.py) decides text-vs-binary by exact suffix membership acrossCODE_EXTENSIONS+DOCUMENTATION_EXTENSIONS+ADDITIONAL_TEXT_EXTENSIONS.constants.pyhas.jsoninADDITIONAL_TEXT_EXTENSIONSbut not.jsonl(the suffix ofdata.jsonlis.jsonl, not.json), sois_text_file("data.jsonl")returnsFalse.detect_and_convert_encoding()returns content unchanged whenis_text_fileisFalse, so a legacy-encoded.jsonlupload skips UTF-8 normalization (normalize_text_bytes) — unlike.json— and then gets vectorized as text. That's the exact mojibake class #2770 (fix(parse): normalize legacy text encodings) fixed for text files..jsonlis a first-class artifact here (session logs,ovpackindex_records.jsonl, eval records), so the gap is real.Fix
Add
.jsonltoADDITIONAL_TEXT_EXTENSIONS(next to.json). One set suffices —is_text_fileunions all three. This makes.jsonlconsistent with.jsonon the upload-encoding path (UTF-8 normalization applies) and consistent with its own treatment inembedding_utils.pyfrom #2745. Behavior for every other extension is unchanged.Tests
tests/test_upload_utils.py::test_additional_text_extensionsnow assertsis_text_file("data.jsonl") is True.Refs #2745, #2744, #2770.