Skip to content

fix: add configurable file encoding to resolve #2759 (UnicodeDecodeError on JP Windows)#2791

Open
sfc-gh-moczko wants to merge 4 commits intomainfrom
feature/encoding-fix
Open

fix: add configurable file encoding to resolve #2759 (UnicodeDecodeError on JP Windows)#2791
sfc-gh-moczko wants to merge 4 commits intomainfrom
feature/encoding-fix

Conversation

@sfc-gh-moczko
Copy link
Copy Markdown
Collaborator

Summary

  • Adds a configuration-driven encoding system (SNOWFLAKE_CLI_ENCODING_FILE_IO env var or [cli.encoding] file_io in config.toml) to allow users on non-UTF-8 locales (e.g., Japanese Windows with CP932) to specify the file I/O encoding
  • Passes configured encoding through SecurePath.read_text() and SecurePath.open() at all file-reading callsites: SQL statement reader, Jinja rendering, Cortex commands, Snowpark requirements, and entity script templates
  • Returns None (platform default) when unconfigured, preserving existing behavior for all current users

Changes

  • New: src/snowflake/cli/api/encoding.py -- get_file_io_encoding(), get_subprocess_encoding(), detect_encoding_environment()
  • Modified: statement_reader.py -- from_file() and files_reader() now pass encoding (root cause of SNOW-3066557: UnicodeDecodeError when executing SQL files with UTF-8 encoding on Japanese Windows #2759)
  • Modified: rendering/jinja.py -- read_file_content() and procedure_from_js_file()
  • Modified: entities/utils.py -- render_script_template()
  • Modified: cortex/commands.py and cortex/manager.py -- 5 read_text() callsites
  • Modified: snowpark/commands.py -- _read_snowflake_requirements_file()
  • Modified: api/config.py -- added ENCODING_SECTION_PATH constant

Test plan

  • 12 unit tests for encoding.py (env var, config, precedence, detection, mismatch warning)
  • 2 new tests in test_statement_reader.py verifying UTF-8 Japanese SQL files work through both files_reader() and ParsedStatement.from_file()
  • 2 new tests in test_secure_path.py verifying read_text(encoding=) and open(encoding=) pass-through
  • Bug reproduction confirmed: CP932 locale + UTF-8 Japanese SQL -> UnicodeDecodeError without fix, succeeds with SNOWFLAKE_CLI_ENCODING_FILE_IO=utf-8
  • All pre-commit hooks pass (ruff, black, mypy, check-ast, codespell, license headers)

Closes #2759

.... Generated with Cortex Code

@sfc-gh-moczko sfc-gh-moczko requested a review from a team as a code owner February 27, 2026 16:45
…ror on JP Windows)

On non-UTF-8 Windows locales (e.g., Japanese CP932), `snow sql -f`
fails with UnicodeDecodeError because file I/O operations use the
platform default encoding instead of UTF-8. This adds a configuration-
driven encoding system that lets users set their preferred encoding
via SNOWFLAKE_CLI_ENCODING_FILE_IO env var or config.toml without
changing behavior for existing users on UTF-8 systems.

Closes #2759

.... Generated with [Cortex Code](https://docs.snowflake.com/en/user-guide/cortex-code/cortex-code)

Co-Authored-By: Cortex Code <noreply@snowflake.com>
tests/sql/test_statement_reader.py writes UTF-8 files then reads them
without configuring an encoding. On Windows the platform default is
charmap (cp1252), causing a UnicodeDecodeError. Use monkeypatch to set
SNOWFLAKE_CLI_ENCODING_FILE_IO=utf-8 so the tests verify the intended
configured behaviour on all platforms.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@sfc-gh-olorek sfc-gh-olorek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contibution! Left some comments

Comment on lines +172 to +175
prompt = SecurePath(file).read_text(
file_size_limit_mb=DEFAULT_SIZE_LIMIT_MB,
encoding=get_file_io_encoding(),
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this! Just to be complete I noticed that nativeapp/artifacts.py and spcs/services/manager.py still uses the old version. Both would still fail on JP Windows. Could you also change it there?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — added encoding=get_file_io_encoding() to SecurePath.open() in both nativeapp/artifacts.py and spcs/services/manager.py.

Comment on lines +26 to +43
def get_file_io_encoding() -> Optional[str]:
"""Get configured file I/O encoding, or None for platform default.

Resolution order:
1. SNOWFLAKE_CLI_ENCODING_FILE_IO environment variable
2. config.toml [cli.encoding] file_io value
3. None (use platform default)
"""
env_encoding = os.environ.get("SNOWFLAKE_CLI_ENCODING_FILE_IO")
if env_encoding:
return env_encoding

try:
from snowflake.cli.api.config import get_config_value

return get_config_value("cli", "encoding", key="file_io")
except Exception:
return None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of invalid encoding for exmpale: SNOWFLAKE_CLI_ENCODING_FILE_IO=not-a-codec the error will be generated late during file read. Could you add as a last step a verfication that the encoding is valid?:

try:
  codecs.lookup(encoding)
except LookupError:
   raise CliError(
                f"Unknown encoding '{encoding}'. Check SNOWFLAKE_CLI_ENCODING_FILE_IO "
                f"or [cli.encoding] file_io in config.toml."
            )

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — added _validate_encoding() which calls codecs.lookup() and raises CliError for unknown codec names. Validation fires immediately when the env var or config value is read, not deferred to file open time.

Comment on lines +46 to +63
def get_subprocess_encoding() -> Optional[str]:
"""Get configured subprocess encoding, or None for platform default.

Resolution order:
1. SNOWFLAKE_CLI_ENCODING_SUBPROCESS environment variable
2. config.toml [cli.encoding] subprocess value
3. None (use platform default)
"""
env_encoding = os.environ.get("SNOWFLAKE_CLI_ENCODING_SUBPROCESS")
if env_encoding:
return env_encoding

try:
from snowflake.cli.api.config import get_config_value

return get_config_value("cli", "encoding", key="subprocess")
except Exception:
return None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is never used in the non-test codebase. Can we move it to the test package if it is a helper method or delete it if it not needed at all?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — detect_encoding_environment() is moved to the test file as a local helper. It is no longer in encoding.py.

PLUGINS_SECTION_PATH = [CLI_SECTION, PLUGINS_SECTION]
PLUGIN_ENABLED_KEY = "enabled"
FEATURE_FLAGS_SECTION_PATH = [CLI_SECTION, "features"]
ENCODING_SECTION_PATH = [CLI_SECTION, "encoding"]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable is never used. Could we delete it or acutally use it? : see comment about: get_config_value

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — ENCODING_SECTION_PATH is now imported and used in encoding.py, removing the duplicate hardcoded path.

try:
from snowflake.cli.api.config import get_config_value

return get_config_value("cli", "encoding", key="file_io")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use ENCODING_SECTION_PATH for the hardcoded constants.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — both get_file_io_encoding() and get_subprocess_encoding() now call get_config_value(*ENCODING_SECTION_PATH, key=...).

Comment on lines +42 to +43
except Exception:
return None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

except Exception with return None swallows all exceptions. What do you think about making the catch more granural (for example: KeyError, MissingConfiguration) and log the problem?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — replaced except Exception with _MISSING_ENCODING_EXCEPTIONS = (KeyError, NonExistentKey, MissingConfigOptionError, ConfigSourceError), matching exactly the exceptions get_config_value raises when a key is absent. Added a log.debug() call on the not-configured path.

"locale": locale.getpreferredencoding(),
}

encodings = {v.lower().replace("-", "") for v in env_info.values()}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: codec name can also contain underscore for exmple: utf_8. Could you use Python's codecs.lookup(name).name instead of replacing the -?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — detect_encoding_environment() (now in the test file) uses codecs.lookup(v).name for canonical comparison instead of string replacement.

sfc-gh-moczko and others added 2 commits March 23, 2026 17:14
- Wire ENCODING_SECTION_PATH constant into get_file_io_encoding() and
  get_subprocess_encoding() instead of hard-coded string literals
- Add _validate_encoding() helper: calls codecs.lookup() and raises
  CliError with a clear message on invalid codec names (e.g. bad env var)
- Narrow exception handling from broad `except Exception` to explicit
  (KeyError, NonExistentKey, MissingConfigOptionError, ConfigSourceError)
  with a log.debug() call so nothing is silently swallowed
- Remove detect_encoding_environment() from production code; move it
  to tests/api/test_encoding.py as a test-local helper. Fix codec name
  comparison to use codecs.lookup(v).name instead of fragile .replace()
- Add encoding=get_file_io_encoding() to SecurePath.open() in
  nativeapp/artifacts.py and spcs/services/manager.py so JP Windows
  users are covered in all file-reading paths

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…id encoding

- Simplify _validate_encoding error message to "Check the value set in {source}"
  avoiding the redundant 'or [cli.encoding]...' clause when source is already
  a config path
- Add test_raises_on_invalid_encoding to TestGetFileIoEncoding and
  TestGetSubprocessEncoding so the CliError raised on bad codec names is covered

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SNOW-3066557: UnicodeDecodeError when executing SQL files with UTF-8 encoding on Japanese Windows

2 participants