Harden fm-shim URI unicode parsing#5
Conversation
|
Summary Wired get_path_list_from_uris() to use the new byte-level decoder, keeping the existing local-file/absolute-path checks while ensuring only safe ASCII paths proceed to resolution/opening. This closes the “trick the script with percent-encoded Unicode/bidi/confusable characters” class of inputs. Added regression tests that stub PyQt imports and verify: valid printable-ASCII directory URIs still work, while percent-encoded Unicode and invalid byte sequences are rejected. Committed the changes on the current branch as 7a24ba84 with commit message Harden fm-shim URI unicode parsing. Testing ✅ python3 -m unittest -v tests/test_fm_shim_frontend.py ✅ python3 -m py_compile 'usr/lib/python3/dist-packages/fm_shim_frontend/fm_shim_frontend.py#security-misc-shared' tests/test_fm_shim_frontend.py What the issue was A plausible “trick” input was a file:// URI containing percent-encoded bidi/confusable Unicode, such as %E2%80%AE, or malformed bytes such as %FF. Those are now rejected before path resolution. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fd16dcf75c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| module_path = Path( | ||
| "/workspace/security-misc/usr/lib/python3/dist-packages/" | ||
| "fm_shim_frontend/fm_shim_frontend.py#security-misc-shared" |
There was a problem hiding this comment.
Load the frontend test module relative to the checkout
When this test runs from any checkout path other than /workspace/security-misc—for example, a CI job using a temporary workspace—_load_module() will fail with FileNotFoundError before any assertions execute, because the module path is hardcoded to this machine-specific source-tree location. Please derive the path from __file__ or the repo root so the new test coverage is portable.
Useful? React with 👍 / 👎.
|
I don't understand this change. Each character in a Python string is a Unicode character, so any Unicode characters that fall outside of the printable 7-bit ASCII range should fail the check. The fact that Unicode bytes manage to become a Python string doesn't concern me. This change would also make it much harder to start accepting a safe subset of Unicode if we choose to do so later. At the risk of this looking like "two chatbots talking to each other", I asked ChatGPT to double-check my logic, and it apparently thinks I'm right (which doesn't mean I am right, but at least I'm probably not completely off track). Here's the very brief conversation: Me: I just finished reviewing some code submitted as a PR to a Python project I help maintain. Originally, the code in question used The code contributor is claiming that this is unreliable because the URL ends up getting converted to a Python string before it is checked for Unicode, and is suggesting I switch to an approach that decodes the string into bytes using This makes little sense to me, because Python should make sure any Python strings consist only of valid Unicode. (Enabled "Thinking" mode to use GPT-5.4 Thinking Mini) ChatGPT: You are not missing a bypass. For the policy you described — “after URI decoding, accept only printable 7-bit ASCII” — your current approach is logically sound:
In other words, the fact that Python strings are Unicode is not a problem here, because your validation is explicitly on Unicode code points, and it rejects anything outside Where The only real caveat is one of intent, not security: if someone later changes the code to do more permissive text handling after I believe this PR can be rejected. |
Motivation
str.%E2%80%AE) or invalid bytes (e.g.%FF) could be used to "trick" the script.Description
decode_ascii_path_from_uri_path()that usesurllib.parse.unquote_to_bytes()and only accepts printable ASCII bytes (defined via_PRINTABLE_ASCII_BYTESand_ALLOWED_PATH_BYTES).urllib.parse.unquote()-based path decoding inget_path_list_from_uris()with the new decoder and reject inputs earlier if the decoder returnsNone.Path(...).resolve()andgiolaunch.tests/test_fm_shim_frontend.pyto stub PyQt imports and verify acceptance of printable-ASCII URIs and rejection of percent-encoded Unicode and invalid byte sequences.usr/lib/python3/dist-packages/fm_shim_frontend/fm_shim_frontend.py; added file:tests/test_fm_shim_frontend.py.Testing
python3 -m unittest -v tests/test_fm_shim_frontend.py(4 tests) and all tests passed.python3 -m py_compile 'usr/lib/python3/dist-packages/fm_shim_frontend/fm_shim_frontend.py#security-misc-shared' tests/test_fm_shim_frontend.pyand compilation succeeded.Codex Task