Skip to content

fix(parse): recognize .jsonl as a text file so upload encoding normalization applies (#2745)#2794

Open
r266-tech wants to merge 1 commit into
volcengine:mainfrom
r266-tech:fix/jsonl-text-extension-encoding
Open

fix(parse): recognize .jsonl as a text file so upload encoding normalization applies (#2745)#2794
r266-tech wants to merge 1 commit into
volcengine:mainfrom
r266-tech:fix/jsonl-text-extension-encoding

Conversation

@r266-tech

Copy link
Copy Markdown
Contributor

Completes #2745 (Fixes #2744), which added .jsonl to the vectorization text-extension set in embedding_utils.py, but left the parallel upload-time encoding path treating .jsonl as non-text.

Problem

is_text_file() (openviking/parse/parsers/upload_utils.py) decides text-vs-binary by exact suffix membership across CODE_EXTENSIONS + DOCUMENTATION_EXTENSIONS + ADDITIONAL_TEXT_EXTENSIONS. constants.py has .json in ADDITIONAL_TEXT_EXTENSIONS but not .jsonl (the suffix of data.jsonl is .jsonl, not .json), so is_text_file("data.jsonl") returns False.

detect_and_convert_encoding() returns content unchanged when is_text_file is False, so a legacy-encoded .jsonl upload skips UTF-8 normalization (normalize_text_bytes) — unlike .json — and then gets vectorized as text. That's the exact mojibake class #2770 (fix(parse): normalize legacy text encodings) fixed for text files. .jsonl is a first-class artifact here (session logs, ovpack index_records.jsonl, eval records), so the gap is real.

Fix

Add .jsonl to ADDITIONAL_TEXT_EXTENSIONS (next to .json). One set suffices — is_text_file unions all three. This makes .jsonl consistent with .json on the upload-encoding path (UTF-8 normalization applies) and consistent with its own treatment in embedding_utils.py from #2745. Behavior for every other extension is unchanged.

Tests

tests/test_upload_utils.py::test_additional_text_extensions now asserts is_text_file("data.jsonl") is True.

Refs #2745, #2744, #2770.

…ization applies

Completes volcengine#2745, which added .jsonl to the vectorization text-extension set in
embedding_utils.py but left the parallel upload-time encoding path treating
.jsonl as non-text. is_text_file() decides text-vs-binary by exact suffix
membership across CODE_EXTENSIONS + DOCUMENTATION_EXTENSIONS +
ADDITIONAL_TEXT_EXTENSIONS, which had .json but not .jsonl (the suffix of
data.jsonl is .jsonl, not .json). So detect_and_convert_encoding skipped UTF-8
normalization for a legacy-encoded .jsonl -- unlike .json -- which then got
vectorized as text, the exact mojibake class volcengine#2770 fixed.

Add .jsonl to ADDITIONAL_TEXT_EXTENSIONS (next to .json); is_text_file unions
all three sets so one entry suffices. Behavior for every other extension is
unchanged. Adds a test assert. Refs volcengine#2745, volcengine#2744, volcengine#2770.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

1 participant