fix(build-state-doc): write frontmatter scalars as UTF-8 so non-ASCII round-trips by Pawel-N-pl · Pull Request #39 · bmad-code-org/bmad-automator

Pawel-N-pl · 2026-06-07T08:02:30Z

What

build-state-doc writes frontmatter scalar values (epicName, customInstructions, aiCommand, agentConfig ids, …) as raw UTF-8 instead of \uXXXX escapes, so non-ASCII values survive a write→parse cycle intact. Also hardens the last remaining re.sub site (the agentConfig block splice) against re.error.

Why

The lambda repls from #18 stopped the hard crash on non-ASCII, but the values are still encoded with json.dumps' default ensure_ascii=True, i.e. as "\uXXXX". The frontmatter reader unquote_scalar (core/frontmatter.py) only strips the surrounding quotes — it does not JSON-decode — so those escapes round-trip as the literal text \uXXXX.

On a non-English project this corrupts every such scalar. Live repro on main:

FRONT> aiCommand: "claude — go"
PARSED aiCommand => 'claude \\u2014 go'        # ← literal —, not "—"
PARSED epicName  => 'upmon \\u2014 Epic 3 obs\\u0142u\\u017c'

(Lists like storyRange are unaffected — they're read via parse_string_list_literal, which does json.loads.)

Fix

Route value rendering through a small _yaml_value() helper that dumps with ensure_ascii=False, so the document carries real UTF-8 that quote-stripping recovers correctly.
Convert the agentConfig block re.sub from a raw-string repl to a callable one. Non-ASCII in an agent/model id is unlikely, but a backslash (a \g- or \b-looking token) is realistic and would raise re.error: bad escape or corrupt into a backspace — the same hazard the scalar sites already guard against.

Tests

Extended tests/test_replacement_unicode.py:

test_frontmatter_scalars_round_trip_non_ascii — parses the frontmatter and asserts the scalar equals the original. The existing tests assertIn on the whole document, which passes via the body {{token}} (rendered with str.replace) even when the frontmatter line is escaped — so they didn't catch this.
test_agent_config_special_chars_do_not_break_resub — non-ASCII model id + backslash content through the agentConfig splice.

env -u AI_AGENT npm run verify green (413 python tests). The lone test_state_policy_metadata failure without -u AI_AGENT is the known Claude-Code AI_AGENT env leak, not related to this change.

Summary by CodeRabbit

Bug Fixes
- Fixed state generation to properly preserve non-ASCII characters (Unicode) in frontmatter and configuration instead of converting them to escape sequences, ensuring improved readability for international text.
Tests
- Added comprehensive test coverage for Unicode round-trip serialization and special character handling in agent configuration.

… round-trips The lambda repls (#18) stopped build-state-doc crashing on non-ASCII, but the values are still emitted with json.dumps' default ensure_ascii=True, i.e. as "\uXXXX" escapes. The frontmatter reader (unquote_scalar) only strips the surrounding quotes — it does not JSON-decode — so those escapes survive a write->parse cycle as the literal text "\uXXXX". On a non-English project (epicName / customInstructions / aiCommand with an em-dash or Polish letters) every such scalar reads back corrupted, e.g. aiCommand "claude — go" parses as 'claude — go'. Fix: route value rendering through a small _yaml_value() helper that dumps with ensure_ascii=False, so the document carries real UTF-8 that quote-stripping recovers intact. Also convert the remaining agentConfig block re.sub from a raw-string repl to a callable one — non-ASCII in an agent/model id is unlikely, but a backslash (a "\g"- or "\b"-looking token) is realistic and would raise re.error: bad escape or corrupt into a backspace, the same hazard the scalar sites already guard against. Tests: extend test_replacement_unicode with frontmatter round-trip assertions (the existing tests assertIn on the whole document, which passes via the body {{token}} even when the frontmatter line is escaped) and an agentConfig case covering non-ASCII + backslash content.

coderabbitai · 2026-06-07T08:02:40Z

Warning

Review limit reached

@Pawel-N-pl, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 50 minutes and 30 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1af4eda4-cb5f-44ae-923d-a11894c3d112

📥 Commits

Reviewing files that changed from the base of the PR and between 66926de and 605554f.

📒 Files selected for processing (2)

skills/bmad-story-automator/src/story_automator/commands/state.py
tests/test_replacement_unicode.py

Walkthrough

This PR adds Unicode-safe YAML serialization to state document generation. A new _yaml_value helper prevents ASCII-only escaping, and this helper replaces json.dumps calls throughout config field serialization in cmd_build_state_doc. Tests verify non-ASCII values and backslash patterns survive round-trip parsing without corruption.

Changes

Unicode-safe YAML Serialization

Layer / File(s)	Summary
YAML serialization helper `skills/bmad-story-automator/src/story_automator/commands/state.py`	Introduces `_yaml_value(value: Any) -> str` using `json.dumps(..., ensure_ascii=False)` to serialize values without `\uXXXX` escape sequences, preserving readability for non-ASCII and backslash-containing strings in YAML-formatted text.
State document generation updates `skills/bmad-story-automator/src/story_automator/commands/state.py`	Replaces `json.dumps` with `_yaml_value` across five callsites in `cmd_build_state_doc`: `customInstructions`, `agentConfig` default fields (`defaultPrimary`, `defaultFallback`, `defaultModel`), `perTask` overrides, `complexityOverrides` entries, and the final replacement loop.
Unicode and special character test coverage `tests/test_replacement_unicode.py`	Adds import for `parse_simple_frontmatter` and two new regression test cases: one validating non-ASCII frontmatter scalars round-trip through parsing without escaping, another verifying `agentConfig` content with backslash patterns (e.g., `\g<0>`) remains intact without regex errors or literal escape sequences.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

bmad-code-org/bmad-automator#7: Both PRs address non-ASCII and backslash handling in state document generation, with this PR using YAML-safe serialization to avoid \uXXXX escapes while the related PR changes the re.sub replacement mechanism.

Suggested reviewers

bma-d

Poem

🐰 Hops through Unicode fields so bright,
ensure_ascii=False sets things right!
No more \uXXXX escape disguise,
Round-trip safe with grateful sighs. ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the primary change: fixing frontmatter scalar serialization to use UTF-8 instead of JSON escape sequences for non-ASCII values.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/state-frontmatter-unicode-roundtrip

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

augmentcode · 2026-06-07T08:05:43Z

🤖 Augment PR Summary

Summary: This PR fixes Unicode corruption in state-document frontmatter by ensuring scalar values are written as real UTF-8 rather than JSON-style \uXXXX escapes.

Changes:

Adds a _yaml_value() helper that uses json.dumps(..., ensure_ascii=False) for frontmatter scalar rendering
Updates build-state-doc to use this helper across scalar frontmatter fields and agent config entries
Hardens the agentConfig block splice by switching the re.sub replacement to a callable (avoids re.error on backslash-like tokens)
Extends Unicode regression tests to validate scalar round-trip parsing and agentConfig splice safety with non-ASCII + backslashes

Technical Notes: This aligns serialization with the current frontmatter reader behavior (unquote_scalar strips quotes but does not JSON-decode).

_{🤖 Was this summary useful? React with 👍 or 👎}

augmentcode

Review completed. 1 suggestion posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2026-06-07T08:05:44Z

+    # with unquote_scalar (frontmatter.py), which strips the surrounding quotes
+    # WITHOUT JSON-decoding, so an ensure_ascii escape would round-trip as the
+    # literal text "\uXXXX" and corrupt the value.
+    return json.dumps(value, ensure_ascii=False)


_yaml_value() now emits raw non-ASCII characters, but cmd_build_state_doc writes via Path.write_text() without an explicit encoding, which can raise UnicodeEncodeError or corrupt output on non-UTF-8 default locales (notably some Windows setups). This makes the new behavior potentially non-portable unless the state doc is always written as UTF-8.

Severity: medium

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

Good catch — fixed in 605554f. write_text was safe before this PR because the output was pure ASCII (\uXXXX); emitting raw UTF-8 made the bare call locale-dependent. Now write_text(text, encoding="utf-8"), matching the repo's read_text helper. Verified the unicode tests stay green under LC_ALL=C, where the unencoded write would have raised UnicodeEncodeError.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

tests/test_replacement_unicode.py (1)
130-146: ⚡ Quick win

Assert the backslash-heavy primary value survives verbatim.

This regression test currently proves the block splice does not throw, but it would still pass if agent\g<0> were omitted or rewritten during substitution. Please assert the emitted primary string too, since that is the contract this case is trying to pin down.
Suggested assertion
         text = self._build_state(config).read_text(encoding="utf-8")
+        self.assertIn(r"agent\g<0>", text)
         self.assertIn("opus — żółw", text)
         self.assertNotIn(r"\u", text)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_replacement_unicode.py` around lines 130 - 146, The test
test_agent_config_special_chars_do_not_break_resub should also assert that the
backslash-heavy primary value is emitted verbatim; after building text =
self._build_state(config).read_text(...), add an assertion that the exact string
r"agent\g<0>" (the perTask["dev"]["primary"] value) is present in text along
with the existing checks for "opus — żółw" and no "\u" escapes so the test
verifies the primary field survived the re.sub unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@skills/bmad-story-automator/src/story_automator/commands/state.py`:
- Around line 14-19: The state file is written with output_path.write_text(text)
which relies on locale and can mis-encode the non-ASCII bytes emitted by
_yaml_value(); modify the cmd_build_state_doc implementation to write the file
using explicit UTF-8 (for example encode the text with UTF-8 and write bytes or
call write_text with encoding="utf-8") so that the JSON/YAML strings from
_yaml_value() round-trip correctly; update the code path in cmd_build_state_doc
that constructs and persists text to use explicit UTF-8 output.

---

Nitpick comments:
In `@tests/test_replacement_unicode.py`:
- Around line 130-146: The test
test_agent_config_special_chars_do_not_break_resub should also assert that the
backslash-heavy primary value is emitted verbatim; after building text =
self._build_state(config).read_text(...), add an assertion that the exact string
r"agent\g<0>" (the perTask["dev"]["primary"] value) is present in text along
with the existing checks for "opus — żółw" and no "\u" escapes so the test
verifies the primary field survived the re.sub unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f34e31ef-2a5e-4c92-9fc1-4b0d2ef1d782

📥 Commits

Reviewing files that changed from the base of the PR and between f332173 and 66926de.

📒 Files selected for processing (2)

skills/bmad-story-automator/src/story_automator/commands/state.py
tests/test_replacement_unicode.py

…ckslash test Addresses review feedback on the non-ASCII serialization change: - output_path.write_text(text) had no explicit encoding. Now that the document carries raw UTF-8 (not \uXXXX escapes), the bare write_text would use the platform default locale and could raise UnicodeEncodeError / mojibake on a non-UTF-8 locale (Windows cp1252, POSIX C). Pass encoding="utf-8", matching the repo's read_text helper. (Augment medium / CodeRabbit major.) - Strengthen the agentConfig regression test: assert the backslash value is emitted verbatim ("agent\\g<0>", JSON-doubled) so the test pins the value survives the splice, not merely that re.sub didn't raise. (CodeRabbit nitpick.)

Pawel-N-pl · 2026-06-07T08:12:37Z

@coderabbitai re: the agentConfig test nitpick — good point that the test should pin the value, not just that re.sub didn't throw. Addressed in 605554f, but note the suggested assertIn(r"agent\g<0>", text) wouldn't pass: json.dumps doubles the backslash, so the block carries the JSON-escaped form agent\\g<0>. The added assertion uses that verbatim form (assertIn(r"agent\\g<0>", text)) so it proves the value survives the splice.

coderabbitai · 2026-06-07T08:13:08Z

Tip

For best results, initiate chat on the files or code changes.

@Pawel-N-pl — good catch on my suggestion. You're right: json.dumps escapes \ → \\, so the serialized text contains two backslashes, and assertIn(r"agent\g<0>", text) (one backslash) would never match. Using assertIn(r"agent\\g<0>", text) correctly asserts the JSON-escaped form that's actually present in the document. Thanks for the clear explanation and for landing the fix in 605554f.

🐇 ✨

bma-d

Thanks for the work here. I'm requesting changes on the scalar round-trip issue below because the new writer still emits JSON escapes that the existing frontmatter readers do not decode.

bma-d · 2026-06-12T07:35:42Z

+    # with unquote_scalar (frontmatter.py), which strips the surrounding quotes
+    # WITHOUT JSON-decoding, so an ensure_ascii escape would round-trip as the
+    # literal text "\uXXXX" and corrupt the value.
+    return json.dumps(value, ensure_ascii=False)


This fixes raw non-ASCII output, but it still leaves the scalar writer and readers on different contracts. _yaml_value() emits JSON string escapes for backslashes, quotes, and control characters, while parse_simple_frontmatter() / unquote_scalar() only strip the surrounding quotes. Those values are persisted differently when the generated state is read back.

Repro:

PYTHONPATH=skills/bmad-story-automator/src python - <<'PY' from story_automator.commands.state import _yaml_value from story_automator.core.frontmatter import parse_simple_frontmatter for value in [r"path\\to\\file", 'say "hi"', "line1\nline2", r"agent\\g<0>", "obsłuż"]: doc = f"---\naiCommand: {_yaml_value(value)}\n---\n" parsed = parse_simple_frontmatter(doc)["aiCommand"] print({"input": value, "serialized": _yaml_value(value), "parsed": parsed, "equal": parsed == value}) PY

The backslash, quote, and newline cases print equal: False; only the non-ASCII case prints True. Please make scalar serialization/parsing a single contract, or update the readers to decode the emitted representation, and add parity coverage for these escaped scalar cases including agentConfig values.

fix(story-automator): address maintainer review on bmad-code-org#38/bmad-code-org#39/bmad-code-org#41/bmad-code-org#42

Pawel-N-pl requested a review from bma-d as a code owner June 7, 2026 08:02

augmentcode Bot reviewed Jun 7, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 7, 2026

View reviewed changes

Comment thread skills/bmad-story-automator/src/story_automator/commands/state.py

bma-d requested changes Jun 12, 2026

View reviewed changes

Pawel-N-pl mentioned this pull request Jun 12, 2026

fix(story-automator): address maintainer review on #38/#39/#41/#42 Pawel-N-pl/bmad-automator#5

Merged

Pawel-N-pl added a commit to Pawel-N-pl/bmad-automator that referenced this pull request Jun 12, 2026

Merge pull request #5 from Pawel-N-pl/fix/brian-review-followups

cd7b494

fix(story-automator): address maintainer review on bmad-code-org#38/bmad-code-org#39/bmad-code-org#41/bmad-code-org#42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(build-state-doc): write frontmatter scalars as UTF-8 so non-ASCII round-trips#39

fix(build-state-doc): write frontmatter scalars as UTF-8 so non-ASCII round-trips#39
Pawel-N-pl wants to merge 2 commits into
mainfrom
fix/state-frontmatter-unicode-roundtrip

Pawel-N-pl commented Jun 7, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 7, 2026 •

edited

Loading

Review limit reached

❌ Failed checks (1 warning)

Uh oh!

augmentcode Bot commented Jun 7, 2026

Uh oh!

augmentcode Bot left a comment

Uh oh!

augmentcode Bot Jun 7, 2026 •

edited

Loading

Uh oh!

Pawel-N-pl Jun 7, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Pawel-N-pl commented Jun 7, 2026

Uh oh!

coderabbitai Bot commented Jun 7, 2026

Uh oh!

bma-d left a comment

Uh oh!

bma-d Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

Pawel-N-pl commented Jun 7, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Fix

Tests

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

augmentcode Bot commented Jun 7, 2026

Uh oh!

augmentcode Bot left a comment

Choose a reason for hiding this comment

Uh oh!

augmentcode Bot Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Pawel-N-pl Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Pawel-N-pl commented Jun 7, 2026

Uh oh!

coderabbitai Bot commented Jun 7, 2026

Uh oh!

bma-d left a comment

Choose a reason for hiding this comment

Uh oh!

bma-d Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Pawel-N-pl commented Jun 7, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 7, 2026 •

edited

Loading

augmentcode Bot Jun 7, 2026 •

edited

Loading