Skip to content

fix(build-state-doc): write frontmatter scalars as UTF-8 so non-ASCII round-trips#39

Open
Pawel-N-pl wants to merge 2 commits into
mainfrom
fix/state-frontmatter-unicode-roundtrip
Open

fix(build-state-doc): write frontmatter scalars as UTF-8 so non-ASCII round-trips#39
Pawel-N-pl wants to merge 2 commits into
mainfrom
fix/state-frontmatter-unicode-roundtrip

Conversation

@Pawel-N-pl

@Pawel-N-pl Pawel-N-pl commented Jun 7, 2026

Copy link
Copy Markdown

What

build-state-doc writes frontmatter scalar values (epicName, customInstructions, aiCommand, agentConfig ids, …) as raw UTF-8 instead of \uXXXX escapes, so non-ASCII values survive a write→parse cycle intact. Also hardens the last remaining re.sub site (the agentConfig block splice) against re.error.

Why

The lambda repls from #18 stopped the hard crash on non-ASCII, but the values are still encoded with json.dumps' default ensure_ascii=True, i.e. as "\uXXXX". The frontmatter reader unquote_scalar (core/frontmatter.py) only strips the surrounding quotes — it does not JSON-decode — so those escapes round-trip as the literal text \uXXXX.

On a non-English project this corrupts every such scalar. Live repro on main:

FRONT> aiCommand: "claude — go"
PARSED aiCommand => 'claude \\u2014 go'        # ← literal —, not "—"
PARSED epicName  => 'upmon \\u2014 Epic 3 obs\\u0142u\\u017c'

(Lists like storyRange are unaffected — they're read via parse_string_list_literal, which does json.loads.)

Fix

  • Route value rendering through a small _yaml_value() helper that dumps with ensure_ascii=False, so the document carries real UTF-8 that quote-stripping recovers correctly.
  • Convert the agentConfig block re.sub from a raw-string repl to a callable one. Non-ASCII in an agent/model id is unlikely, but a backslash (a \g- or \b-looking token) is realistic and would raise re.error: bad escape or corrupt into a backspace — the same hazard the scalar sites already guard against.

Tests

Extended tests/test_replacement_unicode.py:

  • test_frontmatter_scalars_round_trip_non_ascii — parses the frontmatter and asserts the scalar equals the original. The existing tests assertIn on the whole document, which passes via the body {{token}} (rendered with str.replace) even when the frontmatter line is escaped — so they didn't catch this.
  • test_agent_config_special_chars_do_not_break_resub — non-ASCII model id + backslash content through the agentConfig splice.

env -u AI_AGENT npm run verify green (413 python tests). The lone test_state_policy_metadata failure without -u AI_AGENT is the known Claude-Code AI_AGENT env leak, not related to this change.

Summary by CodeRabbit

  • Bug Fixes

    • Fixed state generation to properly preserve non-ASCII characters (Unicode) in frontmatter and configuration instead of converting them to escape sequences, ensuring improved readability for international text.
  • Tests

    • Added comprehensive test coverage for Unicode round-trip serialization and special character handling in agent configuration.

… round-trips

The lambda repls (#18) stopped build-state-doc crashing on non-ASCII, but the
values are still emitted with json.dumps' default ensure_ascii=True, i.e. as
"\uXXXX" escapes. The frontmatter reader (unquote_scalar) only strips the
surrounding quotes — it does not JSON-decode — so those escapes survive a
write->parse cycle as the literal text "\uXXXX". On a non-English project
(epicName / customInstructions / aiCommand with an em-dash or Polish letters)
every such scalar reads back corrupted, e.g. aiCommand "claude — go" parses as
'claude — go'.

Fix: route value rendering through a small _yaml_value() helper that dumps with
ensure_ascii=False, so the document carries real UTF-8 that quote-stripping
recovers intact. Also convert the remaining agentConfig block re.sub from a
raw-string repl to a callable one — non-ASCII in an agent/model id is unlikely,
but a backslash (a "\g"- or "\b"-looking token) is realistic and would raise
re.error: bad escape or corrupt into a backspace, the same hazard the scalar
sites already guard against.

Tests: extend test_replacement_unicode with frontmatter round-trip assertions
(the existing tests assertIn on the whole document, which passes via the body
{{token}} even when the frontmatter line is escaped) and an agentConfig case
covering non-ASCII + backslash content.
@Pawel-N-pl Pawel-N-pl requested a review from bma-d as a code owner June 7, 2026 08:02
@coderabbitai

coderabbitai Bot commented Jun 7, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@Pawel-N-pl, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 50 minutes and 30 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1af4eda4-cb5f-44ae-923d-a11894c3d112

📥 Commits

Reviewing files that changed from the base of the PR and between 66926de and 605554f.

📒 Files selected for processing (2)
  • skills/bmad-story-automator/src/story_automator/commands/state.py
  • tests/test_replacement_unicode.py

Walkthrough

This PR adds Unicode-safe YAML serialization to state document generation. A new _yaml_value helper prevents ASCII-only escaping, and this helper replaces json.dumps calls throughout config field serialization in cmd_build_state_doc. Tests verify non-ASCII values and backslash patterns survive round-trip parsing without corruption.

Changes

Unicode-safe YAML Serialization

Layer / File(s) Summary
YAML serialization helper
skills/bmad-story-automator/src/story_automator/commands/state.py
Introduces _yaml_value(value: Any) -> str using json.dumps(..., ensure_ascii=False) to serialize values without \uXXXX escape sequences, preserving readability for non-ASCII and backslash-containing strings in YAML-formatted text.
State document generation updates
skills/bmad-story-automator/src/story_automator/commands/state.py
Replaces json.dumps with _yaml_value across five callsites in cmd_build_state_doc: customInstructions, agentConfig default fields (defaultPrimary, defaultFallback, defaultModel), perTask overrides, complexityOverrides entries, and the final replacement loop.
Unicode and special character test coverage
tests/test_replacement_unicode.py
Adds import for parse_simple_frontmatter and two new regression test cases: one validating non-ASCII frontmatter scalars round-trip through parsing without escaping, another verifying agentConfig content with backslash patterns (e.g., \g<0>) remains intact without regex errors or literal escape sequences.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

  • bmad-code-org/bmad-automator#7: Both PRs address non-ASCII and backslash handling in state document generation, with this PR using YAML-safe serialization to avoid \uXXXX escapes while the related PR changes the re.sub replacement mechanism.

Suggested reviewers

  • bma-d

Poem

🐰 Hops through Unicode fields so bright,
ensure_ascii=False sets things right!
No more \uXXXX escape disguise,
Round-trip safe with grateful sighs.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the primary change: fixing frontmatter scalar serialization to use UTF-8 instead of JSON escape sequences for non-ASCII values.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/state-frontmatter-unicode-roundtrip

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@augmentcode

augmentcode Bot commented Jun 7, 2026

Copy link
Copy Markdown
🤖 Augment PR Summary

Summary: This PR fixes Unicode corruption in state-document frontmatter by ensuring scalar values are written as real UTF-8 rather than JSON-style \uXXXX escapes.

Changes:

  • Adds a _yaml_value() helper that uses json.dumps(..., ensure_ascii=False) for frontmatter scalar rendering
  • Updates build-state-doc to use this helper across scalar frontmatter fields and agent config entries
  • Hardens the agentConfig block splice by switching the re.sub replacement to a callable (avoids re.error on backslash-like tokens)
  • Extends Unicode regression tests to validate scalar round-trip parsing and agentConfig splice safety with non-ASCII + backslashes

Technical Notes: This aligns serialization with the current frontmatter reader behavior (unquote_scalar strips quotes but does not JSON-decode).

🤖 Was this summary useful? React with 👍 or 👎

@augmentcode augmentcode Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 1 suggestion posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

# with unquote_scalar (frontmatter.py), which strips the surrounding quotes
# WITHOUT JSON-decoding, so an ensure_ascii escape would round-trip as the
# literal text "\uXXXX" and corrupt the value.
return json.dumps(value, ensure_ascii=False)

@augmentcode augmentcode Bot Jun 7, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_yaml_value() now emits raw non-ASCII characters, but cmd_build_state_doc writes via Path.write_text() without an explicit encoding, which can raise UnicodeEncodeError or corrupt output on non-UTF-8 default locales (notably some Windows setups). This makes the new behavior potentially non-portable unless the state doc is always written as UTF-8.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in 605554f. write_text was safe before this PR because the output was pure ASCII (\uXXXX); emitting raw UTF-8 made the bare call locale-dependent. Now write_text(text, encoding="utf-8"), matching the repo's read_text helper. Verified the unicode tests stay green under LC_ALL=C, where the unencoded write would have raised UnicodeEncodeError.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tests/test_replacement_unicode.py (1)

130-146: ⚡ Quick win

Assert the backslash-heavy primary value survives verbatim.

This regression test currently proves the block splice does not throw, but it would still pass if agent\g<0> were omitted or rewritten during substitution. Please assert the emitted primary string too, since that is the contract this case is trying to pin down.

Suggested assertion
         text = self._build_state(config).read_text(encoding="utf-8")
+        self.assertIn(r"agent\g<0>", text)
         self.assertIn("opus — żółw", text)
         self.assertNotIn(r"\u", text)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_replacement_unicode.py` around lines 130 - 146, The test
test_agent_config_special_chars_do_not_break_resub should also assert that the
backslash-heavy primary value is emitted verbatim; after building text =
self._build_state(config).read_text(...), add an assertion that the exact string
r"agent\g<0>" (the perTask["dev"]["primary"] value) is present in text along
with the existing checks for "opus — żółw" and no "\u" escapes so the test
verifies the primary field survived the re.sub unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@skills/bmad-story-automator/src/story_automator/commands/state.py`:
- Around line 14-19: The state file is written with output_path.write_text(text)
which relies on locale and can mis-encode the non-ASCII bytes emitted by
_yaml_value(); modify the cmd_build_state_doc implementation to write the file
using explicit UTF-8 (for example encode the text with UTF-8 and write bytes or
call write_text with encoding="utf-8") so that the JSON/YAML strings from
_yaml_value() round-trip correctly; update the code path in cmd_build_state_doc
that constructs and persists text to use explicit UTF-8 output.

---

Nitpick comments:
In `@tests/test_replacement_unicode.py`:
- Around line 130-146: The test
test_agent_config_special_chars_do_not_break_resub should also assert that the
backslash-heavy primary value is emitted verbatim; after building text =
self._build_state(config).read_text(...), add an assertion that the exact string
r"agent\g<0>" (the perTask["dev"]["primary"] value) is present in text along
with the existing checks for "opus — żółw" and no "\u" escapes so the test
verifies the primary field survived the re.sub unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f34e31ef-2a5e-4c92-9fc1-4b0d2ef1d782

📥 Commits

Reviewing files that changed from the base of the PR and between f332173 and 66926de.

📒 Files selected for processing (2)
  • skills/bmad-story-automator/src/story_automator/commands/state.py
  • tests/test_replacement_unicode.py

Comment thread skills/bmad-story-automator/src/story_automator/commands/state.py
…ckslash test

Addresses review feedback on the non-ASCII serialization change:

- output_path.write_text(text) had no explicit encoding. Now that the document
  carries raw UTF-8 (not \uXXXX escapes), the bare write_text would use the
  platform default locale and could raise UnicodeEncodeError / mojibake on a
  non-UTF-8 locale (Windows cp1252, POSIX C). Pass encoding="utf-8", matching
  the repo's read_text helper. (Augment medium / CodeRabbit major.)
- Strengthen the agentConfig regression test: assert the backslash value is
  emitted verbatim ("agent\\g<0>", JSON-doubled) so the test pins the value
  survives the splice, not merely that re.sub didn't raise. (CodeRabbit nitpick.)
@Pawel-N-pl

Copy link
Copy Markdown
Author

@coderabbitai re: the agentConfig test nitpick — good point that the test should pin the value, not just that re.sub didn't throw. Addressed in 605554f, but note the suggested assertIn(r"agent\g<0>", text) wouldn't pass: json.dumps doubles the backslash, so the block carries the JSON-escaped form agent\\g<0>. The added assertion uses that verbatim form (assertIn(r"agent\\g<0>", text)) so it proves the value survives the splice.

@coderabbitai

coderabbitai Bot commented Jun 7, 2026

Copy link
Copy Markdown

Tip

For best results, initiate chat on the files or code changes.

@Pawel-N-pl — good catch on my suggestion. You're right: json.dumps escapes \\\, so the serialized text contains two backslashes, and assertIn(r"agent\g<0>", text) (one backslash) would never match. Using assertIn(r"agent\\g<0>", text) correctly asserts the JSON-escaped form that's actually present in the document. Thanks for the clear explanation and for landing the fix in 605554f.

🐇 ✨

@bma-d bma-d left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work here. I'm requesting changes on the scalar round-trip issue below because the new writer still emits JSON escapes that the existing frontmatter readers do not decode.

# with unquote_scalar (frontmatter.py), which strips the surrounding quotes
# WITHOUT JSON-decoding, so an ensure_ascii escape would round-trip as the
# literal text "\uXXXX" and corrupt the value.
return json.dumps(value, ensure_ascii=False)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes raw non-ASCII output, but it still leaves the scalar writer and readers on different contracts. _yaml_value() emits JSON string escapes for backslashes, quotes, and control characters, while parse_simple_frontmatter() / unquote_scalar() only strip the surrounding quotes. Those values are persisted differently when the generated state is read back.

Repro:

PYTHONPATH=skills/bmad-story-automator/src python - <<'PY'
from story_automator.commands.state import _yaml_value
from story_automator.core.frontmatter import parse_simple_frontmatter

for value in [r"path\\to\\file", 'say "hi"', "line1\nline2", r"agent\\g<0>", "obsłuż"]:
    doc = f"---\naiCommand: {_yaml_value(value)}\n---\n"
    parsed = parse_simple_frontmatter(doc)["aiCommand"]
    print({"input": value, "serialized": _yaml_value(value), "parsed": parsed, "equal": parsed == value})
PY

The backslash, quote, and newline cases print equal: False; only the non-ASCII case prints True. Please make scalar serialization/parsing a single contract, or update the readers to decode the emitted representation, and add parity coverage for these escaped scalar cases including agentConfig values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants