diff --git a/docs/decompiler/RAII_MIGRATION.md b/docs/decompiler/RAII_MIGRATION.md index a60fcd89e54..0d618efa1b9 100644 --- a/docs/decompiler/RAII_MIGRATION.md +++ b/docs/decompiler/RAII_MIGRATION.md @@ -175,7 +175,7 @@ can land at C++11 and C++20 later. |---|---|---| | #31-1 | This plan | shipped | | #31-2 | Stage 1 — `address.cc`, `space.cc`, `rangeutil.cc` (audit's "range.cc" target) | shipped: the three files were already raw-`new`-free in tree; the `gradle cppRaiiAudit` per-file gate (see [`gradle/cppRaiiAudit.gradle`](../../gradle/cppRaiiAudit.gradle)) was added to fail CI on any regression. Wired into `.github/workflows/build-ghidra.yml` alongside `ignoreAudit` and `objectInputStreamAudit`. | -| #31-3 | Stage 2 — `marshal.cc`, `xml.cc` | open | +| #31-3 | Stage 2 — `marshal.cc`, `xml.cc` | partial: marshal shipped ([PR #46](https://github.com/CryptoJones/GayHydra/pull/46)), xml.y/xml.cc in flight ([PR #51](https://github.com/CryptoJones/GayHydra/pull/51) + [PR #52](https://github.com/CryptoJones/GayHydra/pull/52)); semantic-action sites scoped in [RAII_STAGE_2C_XML.md](RAII_STAGE_2C_XML.md) | | #31-4 | Stage 3 — `database.cc`, `comment.cc`, `cover.cc` | open | | #31-5 | Stage 4 — `type.cc`, `userop.cc` | open | | #31-6 | Stage 5 — pcode core | open | diff --git a/docs/decompiler/RAII_STAGE_2C_XML.md b/docs/decompiler/RAII_STAGE_2C_XML.md new file mode 100644 index 00000000000..5a553afecf1 --- /dev/null +++ b/docs/decompiler/RAII_STAGE_2C_XML.md @@ -0,0 +1,121 @@ +# Rec 31 Stage 2C — `xml.y` / `xml.cc` semantic-action RAII migration + +*Status: design / scoping. No code change in this PR; describes the work that Stage 2C entails so a future PR can be scoped accurately.* + +## What's left after Stage 2B + +PRs [#51](https://github.com/CryptoJones/GayHydra/pull/51) (`XmlScan::lvalue`) and [#52](https://github.com/CryptoJones/GayHydra/pull/52) (`xml_parse::global_scan` lifetime) converted the *epilogue*-section raw `new`s in `xml.y` / `xml.cc` — the hand-written C++ that bison passes through verbatim. The migration was minimal-disruption because the edits were entirely outside the bison-generated table-driven `yyparse` body. + +The remaining raw `new` sites live **inside bison semantic actions**: + + xml.y:150 $$ = new string; // attsinglemid + xml.y:153 $$ = new string; // attdoublemid + xml.y:198 $$ = new Attributes($2); // stagstart + xml.y:200 $$ = new NameValue; // SAttribute + xml.y:208 string *tmp = new string(); ... + (Reference; tmp owned locally, deleted before return) + xml.y:538 Element *newel = new Element(cur); // TreeHandler::startElement + xml.y:624 Document *doc = new Document(); // xml_tree() + +These get inlined into `yyparse`'s big switch statement when bison compiles the grammar. The corresponding parallel sites in `xml.cc` are: + + xml.cc:1598 (yyval.str) = new string; + xml.cc:1616 (yyval.str) = new string; + xml.cc:1736 (yyval.attr) = new Attributes((yyvsp[0].str)); + xml.cc:1748 (yyval.pair) = new NameValue; ... + xml.cc:1790 string *tmp = new string(); ... + +## Why this isn't a hand-edit + +Stages 2A (marshal) and 2B (xml.y epilogue) succeeded with the *parallel-hand-edit* technique: change `xml.y` and `xml.cc` consistently in the regions bison copies verbatim, no bison regeneration required. + +The semantic-action sites can't be hand-edited safely because: + +1. `xml.cc:1598` etc. live inside the bison-generated `yyparse()` body — a state machine driven by the `yytable` / `yydefact` / etc. arrays that bison computes from the grammar. Hand-edits there are robust as long as the surrounding code shape doesn't change. + +2. **But the semantic actions cross the bison `%union` boundary.** The `%union` declaration in `xml.y` is: + + %union { + int4 i; + string *str; + Attributes *attr; + NameValue *pair; + } + + It's a C-style union of raw pointers. The RAII candidate replacements (`unique_ptr`, `unique_ptr`, etc.) **cannot live in a C `union`** — unions require trivially-constructible / trivially-destructible members, which `unique_ptr` is not. + +3. The grammar also declares destructors: + + %destructor { delete $$; } <*> + + bison generates calls to this whenever the parser discards a semantic value on its stack (e.g. during error recovery). If we change the value type to a smart pointer, the destructor body changes shape too. + +So Stage 2C is **not** "change `new` to `make_unique` in five places and call it done." The grammar's value-type model itself has to change. + +## Approach options + +### Option A — switch to `%define api.value.type variant` (bison's C++ variants) + +Bison 3.0 introduced an opt-in C++ tagged-variant value type: + + %language "c++" + %require "3.0" + %define api.value.type variant + %token > CHARDATA CDATA ATTVALUE ... + %type > AttValue attsinglemid attdoublemid ... + %type > EmptyElemTag STag stagstart + %type > SAttribute + +This *does* support non-trivially-destructible value types. The generated parser: + +- becomes a C++ class (`xml::parser` instead of a free `yyparse` function); +- uses move semantics on the value stack; +- has the destructors automatically called when a value is popped — no `%destructor { delete $$; }` needed. + +**Pros:** clean RAII end-to-end. Eliminates manual `delete` in actions (`delete $2; *$$ += *$2;` becomes `*$$ += std::move(*$2);` etc.). The bison maintainers' recommended path for new C++ grammars. + +**Cons:** large API break in the generated code. `xml_parse()`'s body has to be rewritten: `int res = yyparse();` becomes something like `xml::parser p; int res = p.parse();`. `yylex` signature changes — instead of writing to global `yylval`, the lexer fills a parser-supplied `xml::parser::symbol_type`. `yyerror` becomes a member. The diff against the current `xml.cc` would be wholesale rewrite (~2700 of the ~2700 lines). + +The wholesale rewrite is the right move long-term but requires: +- Pin the bison version (3.0.5 is locally buildable; 3.0.4 isn't on modern glibc). Document it in the Makefile + CONTRIBUTING. +- Write a from-scratch xml_parse shim that bridges the new C++ parser API to the existing `xml_parse(istream &, ContentHandler *, int4)` interface that decompiler callers depend on. +- Verify the bison-regenerated `xml.cc` produces the same parse trees on the existing XML corpus. + +### Option B — keep `%union` of raw pointers; manage ownership tighter at the boundary + +Leave the `%union` as-is (raw pointers everywhere on the parser value stack), but: + +- Audit every semantic action that calls `new X(...)` and ensure the corresponding "ownership transfer in" / "consumer deletes" / "drop on error" rules are explicit. +- Convert the `string *tmp = new string(); ... delete tmp;` pattern in line 208 to a stack-local `string tmp;` (no `new` needed, no `delete` needed — it's used once within the action then goes out of scope). +- Accept that the `%destructor { delete $$; }` is the project's RAII story for the parser value stack. +- Add `xml.y` / `xml.cc` to `cppRaiiAudit`'s `PROTECTED_FILES` once the obvious sites (208's `tmp`) are cleaned, with the understanding that the `%union` raw-pointer pattern is grandfathered as bison's contract, not user-level raw ownership. + +**Pros:** small, mechanical, no bison-version dance, no API break. Lines 208 (the `tmp`) is the only one that's an unambiguous code-smell; the rest are bison value-stack convention. + +**Cons:** doesn't actually achieve "no raw `new` in `xml.cc`" — it just decides that some raw-`new`s are OK because bison wrote them. The audit gate has to carry an explicit exception for the `%union`-mediated sites. Slightly weakens the Rec 31 invariant. + +## Recommendation + +**Option B for the next PR; Option A as a separate strategic project.** + +The `string *tmp = new string(); ... delete tmp;` at line 208 is a clear win — it's a stack-local that the author wrote as heap-allocated for no apparent reason. Convert it. The other five semantic-action `new`s are bison's value-stack contract; leaving them as raw pointers and adding `xml.y` / `xml.cc` to `cppRaiiAudit`'s `PROTECTED_FILES` with a documented exception is honest and shipping-ready. + +Option A is the right *eventual* destination but its scope is "rewrite the xml parser front-end." That needs its own sprint, its own design review with the decompiler maintainer, its own corpus-equivalence verification. Not something to slip into a multi-PR thread. + +## Concrete next actions (in order) + +1. **PR — Stage 2C-min (Option B):** convert `xml.y:208`'s `string *tmp = new string(); ... delete tmp;` to a stack-local; mirror in `xml.cc:1790`. Add `xml.y` / `xml.cc` to `cppRaiiAudit`'s `PROTECTED_FILES` with an exception comment for the bison `%union` semantic-action sites. After this, `cppRaiiAudit` would gate any *new* raw `new` introduced in those files but accepts the five existing bison-mediated sites. + +2. **PR — Element parse-tree ownership:** the `xml.y:538` `new Element(cur)` is parse-tree owned by the parent's `children` vector. Convert `Element::children` to `vector>` and update the few callers. This is independent of the `%union` question. + +3. **PR — Document return:** `xml.y:624` `new Document()` flows out through `xml_tree()` to `XmlDecode::ingestStream` where `XmlDecode::~XmlDecode` does `delete document;`. Convert `xml_tree` to return `unique_ptr` and `XmlDecode::document` to `unique_ptr`. Drop the manual destructor. + +4. **Strategic sprint — Option A:** switch `xml.y` to `%define api.value.type variant`, regenerate `xml.cc` with bison-3.0.5, write the C++-parser shim. Coordinate with decompiler maintainer. + +## Bison version note + +Local builds of bison-3.0.4 fail on modern glibc due to a gnulib `fseterr.c` portability bug. **Bison-3.0.5** builds cleanly (verified in this session at `/tmp/bison30/bin/bison`) and the diff against `xml.cc` (which was generated by 3.0.4) is ~615 lines, almost all cosmetic (`#line` directives, version-string bumps, a `default:` case added in 3.0.5). If a regeneration ever lands, pin **bison-3.0.5** in `decompile/cpp/Makefile` and a `CONTRIBUTING` note. + +--- + +*Proudly Made in Nebraska. Go Big Red! 🌽 *