fix: add configurable file encoding to resolve #2759 (UnicodeDecodeError on JP Windows)#2791
fix: add configurable file encoding to resolve #2759 (UnicodeDecodeError on JP Windows)#2791sfc-gh-moczko wants to merge 4 commits intomainfrom
Conversation
…ror on JP Windows) On non-UTF-8 Windows locales (e.g., Japanese CP932), `snow sql -f` fails with UnicodeDecodeError because file I/O operations use the platform default encoding instead of UTF-8. This adds a configuration- driven encoding system that lets users set their preferred encoding via SNOWFLAKE_CLI_ENCODING_FILE_IO env var or config.toml without changing behavior for existing users on UTF-8 systems. Closes #2759 .... Generated with [Cortex Code](https://docs.snowflake.com/en/user-guide/cortex-code/cortex-code) Co-Authored-By: Cortex Code <noreply@snowflake.com>
0e10769 to
97c6e56
Compare
tests/sql/test_statement_reader.py writes UTF-8 files then reads them without configuring an encoding. On Windows the platform default is charmap (cp1252), causing a UnicodeDecodeError. Use monkeypatch to set SNOWFLAKE_CLI_ENCODING_FILE_IO=utf-8 so the tests verify the intended configured behaviour on all platforms. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
sfc-gh-olorek
left a comment
There was a problem hiding this comment.
Thank you for the contibution! Left some comments
| prompt = SecurePath(file).read_text( | ||
| file_size_limit_mb=DEFAULT_SIZE_LIMIT_MB, | ||
| encoding=get_file_io_encoding(), | ||
| ) |
There was a problem hiding this comment.
Thanks for adding this! Just to be complete I noticed that nativeapp/artifacts.py and spcs/services/manager.py still uses the old version. Both would still fail on JP Windows. Could you also change it there?
There was a problem hiding this comment.
Done — added encoding=get_file_io_encoding() to SecurePath.open() in both nativeapp/artifacts.py and spcs/services/manager.py.
| def get_file_io_encoding() -> Optional[str]: | ||
| """Get configured file I/O encoding, or None for platform default. | ||
|
|
||
| Resolution order: | ||
| 1. SNOWFLAKE_CLI_ENCODING_FILE_IO environment variable | ||
| 2. config.toml [cli.encoding] file_io value | ||
| 3. None (use platform default) | ||
| """ | ||
| env_encoding = os.environ.get("SNOWFLAKE_CLI_ENCODING_FILE_IO") | ||
| if env_encoding: | ||
| return env_encoding | ||
|
|
||
| try: | ||
| from snowflake.cli.api.config import get_config_value | ||
|
|
||
| return get_config_value("cli", "encoding", key="file_io") | ||
| except Exception: | ||
| return None |
There was a problem hiding this comment.
In case of invalid encoding for exmpale: SNOWFLAKE_CLI_ENCODING_FILE_IO=not-a-codec the error will be generated late during file read. Could you add as a last step a verfication that the encoding is valid?:
try:
codecs.lookup(encoding)
except LookupError:
raise CliError(
f"Unknown encoding '{encoding}'. Check SNOWFLAKE_CLI_ENCODING_FILE_IO "
f"or [cli.encoding] file_io in config.toml."
)There was a problem hiding this comment.
Done — added _validate_encoding() which calls codecs.lookup() and raises CliError for unknown codec names. Validation fires immediately when the env var or config value is read, not deferred to file open time.
| def get_subprocess_encoding() -> Optional[str]: | ||
| """Get configured subprocess encoding, or None for platform default. | ||
|
|
||
| Resolution order: | ||
| 1. SNOWFLAKE_CLI_ENCODING_SUBPROCESS environment variable | ||
| 2. config.toml [cli.encoding] subprocess value | ||
| 3. None (use platform default) | ||
| """ | ||
| env_encoding = os.environ.get("SNOWFLAKE_CLI_ENCODING_SUBPROCESS") | ||
| if env_encoding: | ||
| return env_encoding | ||
|
|
||
| try: | ||
| from snowflake.cli.api.config import get_config_value | ||
|
|
||
| return get_config_value("cli", "encoding", key="subprocess") | ||
| except Exception: | ||
| return None |
There was a problem hiding this comment.
This method is never used in the non-test codebase. Can we move it to the test package if it is a helper method or delete it if it not needed at all?
There was a problem hiding this comment.
Done — detect_encoding_environment() is moved to the test file as a local helper. It is no longer in encoding.py.
| PLUGINS_SECTION_PATH = [CLI_SECTION, PLUGINS_SECTION] | ||
| PLUGIN_ENABLED_KEY = "enabled" | ||
| FEATURE_FLAGS_SECTION_PATH = [CLI_SECTION, "features"] | ||
| ENCODING_SECTION_PATH = [CLI_SECTION, "encoding"] |
There was a problem hiding this comment.
This variable is never used. Could we delete it or acutally use it? : see comment about: get_config_value
There was a problem hiding this comment.
Done — ENCODING_SECTION_PATH is now imported and used in encoding.py, removing the duplicate hardcoded path.
src/snowflake/cli/api/encoding.py
Outdated
| try: | ||
| from snowflake.cli.api.config import get_config_value | ||
|
|
||
| return get_config_value("cli", "encoding", key="file_io") |
There was a problem hiding this comment.
We can use ENCODING_SECTION_PATH for the hardcoded constants.
There was a problem hiding this comment.
Done — both get_file_io_encoding() and get_subprocess_encoding() now call get_config_value(*ENCODING_SECTION_PATH, key=...).
src/snowflake/cli/api/encoding.py
Outdated
| except Exception: | ||
| return None |
There was a problem hiding this comment.
except Exception with return None swallows all exceptions. What do you think about making the catch more granural (for example: KeyError, MissingConfiguration) and log the problem?
There was a problem hiding this comment.
Done — replaced except Exception with _MISSING_ENCODING_EXCEPTIONS = (KeyError, NonExistentKey, MissingConfigOptionError, ConfigSourceError), matching exactly the exceptions get_config_value raises when a key is absent. Added a log.debug() call on the not-configured path.
src/snowflake/cli/api/encoding.py
Outdated
| "locale": locale.getpreferredencoding(), | ||
| } | ||
|
|
||
| encodings = {v.lower().replace("-", "") for v in env_info.values()} |
There was a problem hiding this comment.
Nit: codec name can also contain underscore for exmple: utf_8. Could you use Python's codecs.lookup(name).name instead of replacing the -?
There was a problem hiding this comment.
Done — detect_encoding_environment() (now in the test file) uses codecs.lookup(v).name for canonical comparison instead of string replacement.
- Wire ENCODING_SECTION_PATH constant into get_file_io_encoding() and get_subprocess_encoding() instead of hard-coded string literals - Add _validate_encoding() helper: calls codecs.lookup() and raises CliError with a clear message on invalid codec names (e.g. bad env var) - Narrow exception handling from broad `except Exception` to explicit (KeyError, NonExistentKey, MissingConfigOptionError, ConfigSourceError) with a log.debug() call so nothing is silently swallowed - Remove detect_encoding_environment() from production code; move it to tests/api/test_encoding.py as a test-local helper. Fix codec name comparison to use codecs.lookup(v).name instead of fragile .replace() - Add encoding=get_file_io_encoding() to SecurePath.open() in nativeapp/artifacts.py and spcs/services/manager.py so JP Windows users are covered in all file-reading paths Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…id encoding
- Simplify _validate_encoding error message to "Check the value set in {source}"
avoiding the redundant 'or [cli.encoding]...' clause when source is already
a config path
- Add test_raises_on_invalid_encoding to TestGetFileIoEncoding and
TestGetSubprocessEncoding so the CliError raised on bad codec names is covered
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
Changes
Test plan
Closes #2759
.... Generated with Cortex Code