fix(build-state-doc): write frontmatter scalars as UTF-8 so non-ASCII round-trips#39
fix(build-state-doc): write frontmatter scalars as UTF-8 so non-ASCII round-trips#39Pawel-N-pl wants to merge 2 commits into
Conversation
… round-trips The lambda repls (#18) stopped build-state-doc crashing on non-ASCII, but the values are still emitted with json.dumps' default ensure_ascii=True, i.e. as "\uXXXX" escapes. The frontmatter reader (unquote_scalar) only strips the surrounding quotes — it does not JSON-decode — so those escapes survive a write->parse cycle as the literal text "\uXXXX". On a non-English project (epicName / customInstructions / aiCommand with an em-dash or Polish letters) every such scalar reads back corrupted, e.g. aiCommand "claude — go" parses as 'claude — go'. Fix: route value rendering through a small _yaml_value() helper that dumps with ensure_ascii=False, so the document carries real UTF-8 that quote-stripping recovers intact. Also convert the remaining agentConfig block re.sub from a raw-string repl to a callable one — non-ASCII in an agent/model id is unlikely, but a backslash (a "\g"- or "\b"-looking token) is realistic and would raise re.error: bad escape or corrupt into a backspace, the same hazard the scalar sites already guard against. Tests: extend test_replacement_unicode with frontmatter round-trip assertions (the existing tests assertIn on the whole document, which passes via the body {{token}} even when the frontmatter line is escaped) and an agentConfig case covering non-ASCII + backslash content.
|
Warning Review limit reached
More reviews will be available in 50 minutes and 30 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
WalkthroughThis PR adds Unicode-safe YAML serialization to state document generation. A new ChangesUnicode-safe YAML Serialization
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
🤖 Augment PR SummarySummary: This PR fixes Unicode corruption in state-document frontmatter by ensuring scalar values are written as real UTF-8 rather than JSON-style Changes:
Technical Notes: This aligns serialization with the current frontmatter reader behavior ( 🤖 Was this summary useful? React with 👍 or 👎 |
| # with unquote_scalar (frontmatter.py), which strips the surrounding quotes | ||
| # WITHOUT JSON-decoding, so an ensure_ascii escape would round-trip as the | ||
| # literal text "\uXXXX" and corrupt the value. | ||
| return json.dumps(value, ensure_ascii=False) |
There was a problem hiding this comment.
_yaml_value() now emits raw non-ASCII characters, but cmd_build_state_doc writes via Path.write_text() without an explicit encoding, which can raise UnicodeEncodeError or corrupt output on non-UTF-8 default locales (notably some Windows setups). This makes the new behavior potentially non-portable unless the state doc is always written as UTF-8.
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
Good catch — fixed in 605554f. write_text was safe before this PR because the output was pure ASCII (\uXXXX); emitting raw UTF-8 made the bare call locale-dependent. Now write_text(text, encoding="utf-8"), matching the repo's read_text helper. Verified the unicode tests stay green under LC_ALL=C, where the unencoded write would have raised UnicodeEncodeError.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tests/test_replacement_unicode.py (1)
130-146: ⚡ Quick winAssert the backslash-heavy
primaryvalue survives verbatim.This regression test currently proves the block splice does not throw, but it would still pass if
agent\g<0>were omitted or rewritten during substitution. Please assert the emittedprimarystring too, since that is the contract this case is trying to pin down.Suggested assertion
text = self._build_state(config).read_text(encoding="utf-8") + self.assertIn(r"agent\g<0>", text) self.assertIn("opus — żółw", text) self.assertNotIn(r"\u", text)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_replacement_unicode.py` around lines 130 - 146, The test test_agent_config_special_chars_do_not_break_resub should also assert that the backslash-heavy primary value is emitted verbatim; after building text = self._build_state(config).read_text(...), add an assertion that the exact string r"agent\g<0>" (the perTask["dev"]["primary"] value) is present in text along with the existing checks for "opus — żółw" and no "\u" escapes so the test verifies the primary field survived the re.sub unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@skills/bmad-story-automator/src/story_automator/commands/state.py`:
- Around line 14-19: The state file is written with output_path.write_text(text)
which relies on locale and can mis-encode the non-ASCII bytes emitted by
_yaml_value(); modify the cmd_build_state_doc implementation to write the file
using explicit UTF-8 (for example encode the text with UTF-8 and write bytes or
call write_text with encoding="utf-8") so that the JSON/YAML strings from
_yaml_value() round-trip correctly; update the code path in cmd_build_state_doc
that constructs and persists text to use explicit UTF-8 output.
---
Nitpick comments:
In `@tests/test_replacement_unicode.py`:
- Around line 130-146: The test
test_agent_config_special_chars_do_not_break_resub should also assert that the
backslash-heavy primary value is emitted verbatim; after building text =
self._build_state(config).read_text(...), add an assertion that the exact string
r"agent\g<0>" (the perTask["dev"]["primary"] value) is present in text along
with the existing checks for "opus — żółw" and no "\u" escapes so the test
verifies the primary field survived the re.sub unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: f34e31ef-2a5e-4c92-9fc1-4b0d2ef1d782
📒 Files selected for processing (2)
skills/bmad-story-automator/src/story_automator/commands/state.pytests/test_replacement_unicode.py
…ckslash test
Addresses review feedback on the non-ASCII serialization change:
- output_path.write_text(text) had no explicit encoding. Now that the document
carries raw UTF-8 (not \uXXXX escapes), the bare write_text would use the
platform default locale and could raise UnicodeEncodeError / mojibake on a
non-UTF-8 locale (Windows cp1252, POSIX C). Pass encoding="utf-8", matching
the repo's read_text helper. (Augment medium / CodeRabbit major.)
- Strengthen the agentConfig regression test: assert the backslash value is
emitted verbatim ("agent\\g<0>", JSON-doubled) so the test pins the value
survives the splice, not merely that re.sub didn't raise. (CodeRabbit nitpick.)
|
@coderabbitai re: the agentConfig test nitpick — good point that the test should pin the value, not just that re.sub didn't throw. Addressed in 605554f, but note the suggested |
|
Tip For best results, initiate chat on the files or code changes.
🐇 ✨ |
bma-d
left a comment
There was a problem hiding this comment.
Thanks for the work here. I'm requesting changes on the scalar round-trip issue below because the new writer still emits JSON escapes that the existing frontmatter readers do not decode.
| # with unquote_scalar (frontmatter.py), which strips the surrounding quotes | ||
| # WITHOUT JSON-decoding, so an ensure_ascii escape would round-trip as the | ||
| # literal text "\uXXXX" and corrupt the value. | ||
| return json.dumps(value, ensure_ascii=False) |
There was a problem hiding this comment.
This fixes raw non-ASCII output, but it still leaves the scalar writer and readers on different contracts. _yaml_value() emits JSON string escapes for backslashes, quotes, and control characters, while parse_simple_frontmatter() / unquote_scalar() only strip the surrounding quotes. Those values are persisted differently when the generated state is read back.
Repro:
PYTHONPATH=skills/bmad-story-automator/src python - <<'PY'
from story_automator.commands.state import _yaml_value
from story_automator.core.frontmatter import parse_simple_frontmatter
for value in [r"path\\to\\file", 'say "hi"', "line1\nline2", r"agent\\g<0>", "obsłuż"]:
doc = f"---\naiCommand: {_yaml_value(value)}\n---\n"
parsed = parse_simple_frontmatter(doc)["aiCommand"]
print({"input": value, "serialized": _yaml_value(value), "parsed": parsed, "equal": parsed == value})
PYThe backslash, quote, and newline cases print equal: False; only the non-ASCII case prints True. Please make scalar serialization/parsing a single contract, or update the readers to decode the emitted representation, and add parity coverage for these escaped scalar cases including agentConfig values.
fix(story-automator): address maintainer review on bmad-code-org#38/bmad-code-org#39/bmad-code-org#41/bmad-code-org#42
What
build-state-docwrites frontmatter scalar values (epicName,customInstructions,aiCommand,agentConfigids, …) as raw UTF-8 instead of\uXXXXescapes, so non-ASCII values survive a write→parse cycle intact. Also hardens the last remainingre.subsite (theagentConfigblock splice) againstre.error.Why
The lambda repls from #18 stopped the hard crash on non-ASCII, but the values are still encoded with
json.dumps' defaultensure_ascii=True, i.e. as"\uXXXX". The frontmatter readerunquote_scalar(core/frontmatter.py) only strips the surrounding quotes — it does not JSON-decode — so those escapes round-trip as the literal text\uXXXX.On a non-English project this corrupts every such scalar. Live repro on
main:(Lists like
storyRangeare unaffected — they're read viaparse_string_list_literal, which doesjson.loads.)Fix
_yaml_value()helper that dumps withensure_ascii=False, so the document carries real UTF-8 that quote-stripping recovers correctly.agentConfigblockre.subfrom a raw-string repl to a callable one. Non-ASCII in an agent/model id is unlikely, but a backslash (a\g- or\b-looking token) is realistic and would raisere.error: bad escapeor corrupt into a backspace — the same hazard the scalar sites already guard against.Tests
Extended
tests/test_replacement_unicode.py:test_frontmatter_scalars_round_trip_non_ascii— parses the frontmatter and asserts the scalar equals the original. The existing testsassertInon the whole document, which passes via the body{{token}}(rendered withstr.replace) even when the frontmatter line is escaped — so they didn't catch this.test_agent_config_special_chars_do_not_break_resub— non-ASCII model id + backslash content through theagentConfigsplice.env -u AI_AGENT npm run verifygreen (413 python tests). The lonetest_state_policy_metadatafailure without-u AI_AGENTis the known Claude-CodeAI_AGENTenv leak, not related to this change.Summary by CodeRabbit
Bug Fixes
Tests