Fix crash on file upload with non-ASCII filename#1352
Fix crash on file upload with non-ASCII filename#1352lawrence3699 wants to merge 1 commit intosissbruecker:masterfrom
Conversation
Restrict sanitize_char in _generate_asset_filename to ASCII-only characters. str.isalnum() returns True for accented letters like é, which then cause UnicodeEncodeError in gzip.open() on systems with ASCII filesystem encoding (common in Docker containers). Adding str.isascii() as a guard ensures only portable ASCII characters reach the filesystem while display_name still preserves the original filename for the UI. Fixes sissbruecker#1348
There was a problem hiding this comment.
Pull request overview
Fixes a crash during asset uploads when the uploaded filename contains non-ASCII characters by tightening filename sanitization and adding a regression test to validate the behavior.
Changes:
- Restrict
_generate_asset_filenameto only allow ASCII alphanumerics (plus-,_,.) in the generated on-disk filename. - Add a regression test for uploading
explodér.html, asserting the stored filename is sanitized whiledisplay_namepreserves the original name.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
bookmarks/services/assets.py |
Updates filename sanitization to prevent non-ASCII characters from being written into filesystem paths. |
bookmarks/tests/test_assets_service.py |
Adds a regression test covering upload behavior with a non-ASCII filename. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def _generate_asset_filename( | ||
| asset: BookmarkAsset, filename: str, extension: str | ||
| ) -> str: | ||
| def sanitize_char(char): | ||
| if char.isalnum() or char in ("-", "_", "."): | ||
| if char.isascii() and (char.isalnum() or char in ("-", "_", ".")): | ||
| return char | ||
| else: | ||
| return "_" |
There was a problem hiding this comment.
_generate_asset_filename now sanitizes non-ASCII characters in the base filename, but the extension argument is still interpolated verbatim into the returned path. For uploaded files, extension comes from os.path.splitext(upload_file.name) and can itself contain non-ASCII characters (e.g. foo.é -> extension é / é.gz), which would still reintroduce the same UnicodeEncodeError on ASCII-encoded filesystems. Consider applying the same ASCII-only sanitization to extension (or normalizing/sanitizing the final composed filename) so all path segments are filesystem-safe.
| def test_upload_asset_with_non_ascii_filename(self): | ||
| bookmark = self.setup_bookmark() | ||
| file_content = b"test content" | ||
| upload_file = SimpleUploadedFile( | ||
| "explodér.html", file_content, content_type="text/html" | ||
| ) | ||
|
|
||
| asset = assets.upload_asset(bookmark, upload_file) |
There was a problem hiding this comment.
Other upload_asset tests are decorated with @disable_logging, but this new test is not. Since upload_asset logs an info message on success, this can add noise to test output and is inconsistent with the existing test suite conventions. Add @disable_logging above this test as well.
Fixes #1348
Problem
Uploading a file with non-ASCII characters in its name (e.g.
explodér.html) crashes withUnicodeEncodeErrorwhengzip.open()tries to create the compressed file on systems where the filesystem encoding is ASCII — the default in many Docker containers.The root cause is in
_generate_asset_filename'ssanitize_char: it useschar.isalnum()which returnsTruefor accented letters likeé('é'.isalnum() → True), so they pass through into the filesystem path. When Python then tries to open that path on an ASCII-encoded filesystem, it fails:Fix
Add
char.isascii()as a guard so only ASCII alphanumeric characters (plus-,_,.) survive into the filename. Non-ASCII characters are replaced with_, same as spaces and other special characters already are. The original filename is still preserved indisplay_namefor the UI.Tests
Added a regression test that uploads a file named
explodér.htmland verifies:éwith_display_namestill shows the original filenameAll 24 tests in
test_assets_service.pypass.