feat: adopt @signalwire/docusaurus-plugin-llms-txt (Phase 1)#497
Conversation
Implements Phase 1 of the docs-driven skill generation plan (lives in HarperFast/skills repo at docs/plans/docs-driven-skills.md). Adds a postBuild step that converts every rendered HTML page to a flat- markdown sibling (foo.html → foo.md) and emits llms.txt / llms-full.txt index files at the build root. Why this plugin specifically: it operates on Docusaurus's postBuild route data, which captures routes from every registered docs plugin instance (learn, reference, fabric, release-notes) uniformly. The HTML→markdown conversion via `unified` means MDX components, theme imports, custom React components, and build-time data are all already resolved before we see them — no per-component handlers or module shims required. See facebook/docusaurus#10899 for the broader community context; this plugin (by SignalWire) is the de facto choice. Changes: - docusaurus.config.ts: register `@signalwire/docusaurus-plugin-llms-txt` with `enableLlmsFullTxt: true`. Defaults handle everything else — all four docs plugin instances are picked up automatically, and the default contentSelectors work for the classic theme. No excludeRoutes: we emit flat markdown for the full site (including v4 reference docs) so the artifacts are useful to any consumer; the skills repo's manifest decides which routes to actually use. - .github/workflows/deploy.yaml: add a verification step after the build that fails the workflow if any of the four docs instances produced zero .md files, or if llms.txt / llms-full.txt are missing. Catches plugin regressions before they reach the deployed site. - package.json: add @signalwire/docusaurus-plugin-llms-txt as a devDependency. Verified locally: `npm run build` produces 371 .md files across all four docs instances (learn: 10, reference: 137, fabric: 10, release-notes: 211) plus llms.txt (385 lines) and llms-full.txt (36913 lines). Spot-checked a v5 reference page, a learn MDX page using imported components, a fabric page, and a release-notes page — content renders correctly through the HTML→MD round-trip. The plugin reports one warning about /reference (the empty top-level reference index page that has no extractable content) — expected and benign. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
🚀 Preview DeploymentYour preview deployment is ready! 🔗 Preview URL: https://preview.harper-documentation.harperfabric.com/pr-497 This preview will update automatically when you push new commits. |
kriszyp
left a comment
There was a problem hiding this comment.
Awesome. This is actually really nice.
Let's jump right to the hard stuff though, https://preview.harper-documentation.harperfabric.com/pr-497/learn/getting-started/create-your-first-application.md:
So the way it does tabs is like this: Local and Fabric are the tabs, it lists both tabs and then outputs their contents sequentially. I am not sure that is ideal, seems like the * Fabric header should go after the Local Installation section:
* Local Installation
* Fabric
Get started by cloning the [`HarperFast/create-your-first-application`](https://github.com/HarperFast/create-your-first-application) repo and opening it your editor of choice. If you have installed Harper using a container, make sure to clone into the `dev/` directory that the container was mounted to.
``
git clone https://github.com/HarperFast/create-your-first-application.git first-harper-app
``
From the "Cluster" page, navigate to the "Applications" tab and click on "New Application" on the left-hand sidebar.
Give the application a name such as "first-harper-app", then click on the "Import" tab.
Specify `https://github.com/HarperFast/create-your-first-application` in the "Git Repository URL" field.
Keep the "Install Command" empty and the "Authorization" as "Public Access".
Finally, click the "Import Application" button and wait for the application to be instantiated.
That being said, this impressively clean. Much better than I expected. I think there are tweaks we should make, but I approve moving forward with this.
|
Okay great, I'll see if we can iterate on the output now before moving to phase 2 of the skill update plan. |
The default HTML→Markdown conversion in @signalwire/docusaurus-plugin- llms-txt produces noisy or misleading output for several Docusaurus- specific constructs. This commit adds a small rehype plugin (scripts/rehype-docusaurus-to-llms.mjs) wired into the plugin's `beforeDefaultRehypePlugins` chain to normalize them before the HTML→MD conversion runs. Three transforms: 1. Tabs (<div class="tabs-container">) → sequential `#### h4` subsections. Previously the default conversion stacked tab labels as a bullet list followed by all panel contents concatenated together, making it read as if every tab's content applied under every label. Now each tab is a properly labeled subsection containing its own content. 245 subsections produced across the site by this transform. 2. Hash-link anchors (<a class="hash-link">) → removed. These are Docusaurus's "direct link to this heading" UI affordances with no semantic value to an LLM. Default output included `[](#anchor "Direct link to ...")` noise next to every heading; now gone. 3. Version badges (<span class="badge_*">Added in<!-- -->: ...</span>) → clean italic text. The React render leaves empty comment markers between text fragments which the default conversion preserved as literal `<!-- -->` strings. Now emits `*Added in: v4.2.0*`. All three constructs are uniformly handled across all four docs plugin instances (learn, reference, fabric, release-notes) because the rehype pass runs at the HTML stage where every page's content looks the same regardless of its source MDX. Spot-checked output before/after on build/learn/getting-started/ create-your-first-application.md (heavy tabs usage) and build/reference/v5/rest/overview.md (VersionBadge); both now read cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Update — addressed the cosmetic issues called out in the original PR description Added a small custom rehype plugin (
These were called out as "known minor cosmetic issues" in the original PR description. They're addressable from the HTML side because Docusaurus emits very consistent, identifiable markup for each construct. The plugin is small (~150 lines, well-commented) and lives at Verified end-to-end: same 371 |
🚀 Preview DeploymentYour preview deployment is ready! 🔗 Preview URL: https://preview.harper-documentation.harperfabric.com/pr-497 This preview will update automatically when you push new commits. |
🧹 Preview CleanupThe preview deployment for this PR has been removed. |
Two unrelated warnings have been firing during every docs build since Phase 1 (#497) landed. Both are now fixed; an unrelated third warning ("Excluded N routes by current config") remains but is just accurate accounting of intentional exclusions. 1. /reference route processing failure src/pages/reference/index.tsx is a client-side React redirect component that returns null. Its rendered HTML is an empty Docusaurus shell with no extractable content. The llms-txt plugin was attempting to process it and emitting a warning ("Route Error: Failed to process route '/reference': Failed to convert HTML to Markdown: HTML to Markdown conversion resulted in empty content"). Fix: add '/reference' to the plugin's excludeRoutes config. This is not a real content page; it's a routing redirect that should never be considered for the flat-markdown export. 2. Broken anchor in resource-api.md reference/resources/resource-api.md line 799 linked to `../components/javascript-environment.md#transaction`, but the heading `### \`transaction(fn)\`` in that target file gets the auto-generated anchor `transactionfn` (parens stripped, dash dropped). The link wanted the semantic name `transaction`, not the auto-id. Fix: add an explicit anchor `{#transaction}` to the heading in javascript-environment.md. This gives the link a stable, human- meaningful target that survives heading-text changes and matches the author's original intent. Docusaurus 3.x supports the {#id} syntax natively. Verified: rebuild produces the same 371 documents, both warnings are gone, and grep confirms the rendered HTML now has id="transaction" directly on the heading. The remaining "Excluded 2 routes by current config" warning is the plugin reporting that two routes were filtered: the home page `/` (a content-pages route, excluded by default via includePages: false) and `/reference` (our new explicit exclusion). Both are intentional; the warning is accurate signal, not noise. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 1 of the docs-driven skill generation plan
Adopts
@signalwire/docusaurus-plugin-llms-txtso every rendered docs page also publishes a flat-markdown sibling at the same path (e.g./reference/v5/rest/overview→/reference/v5/rest/overview.md), plusllms.txtandllms-full.txtindex files at the build root.The plan that drives this work lives in HarperFast/skills → Migration → Phase 1. This PR is the docs-repo side; the skills-repo side (Phase 2 — the generator + workflow that consumes this output) is separate.
Why this plugin
It operates on Docusaurus's
postBuildroute data, which surfaces routes from every registered docs plugin instance —learn,reference(with both v5 and v4),fabric,release-notes— uniformly. The HTML→markdown conversion viaunifiedruns after the build, so MDX components, theme imports, custom React components (<VersionBadge>,<LatestPatchLink>, etc.), build-time data, and partial inclusions are all already resolved by the time the plugin sees them. No per-component handlers needed.Community context: facebook/docusaurus#10899. This plugin (by SignalWire) is the de facto community choice, ~19k weekly downloads, MIT-licensed.
We previously tested the alternative
docusaurus-plugin-llms(by rachfop) — it only picked up one of our four docs plugin instances, which is a hard incompatibility with how this site is configured.What's in this PR
docusaurus.config.tsRegister the plugin with minimal config — defaults handle most things sensibly:
Notes:
excludeRoutes: we emit flat markdown for the full site (including v4 reference docs) so the artifacts are useful to any consumer. The skills repo's manifest decides which routes to actually use.contentSelectorswork for the classic theme; no overrides needed.enableLlmsFullTxt: trueproduces the bundledllms-full.txtalongside the per-page files. Useful for LLM tools that want one file..github/workflows/deploy.yamlAdds a verification step after the build that fails the workflow if any of the four docs instances produced zero
.mdfiles, or ifllms.txt/llms-full.txtare missing. Catches plugin regressions before they reach the deployed site.package.jsonAdds
@signalwire/docusaurus-plugin-llms-txtas a devDependency.Verification
npm run buildlocally produces:build/learn/**/*.mdbuild/reference/**/*.mdbuild/fabric/**/*.mdbuild/release-notes/**/*.mdbuild/llms.txtbuild/llms-full.txtSpot-checked content across all four docs instances — pages render correctly through the HTML→MD round-trip. Tables, code blocks, headings, internal links all preserved.
Known minor cosmetic issues (not blocking)
<VersionBadge>renders as multi-segment text (e.g.Added in<!-- --> : <!-- -->v4.2.0). The semantic content is preserved; the spacing is just awkward. Can be improved later via customrehypePluginsif it bothers downstream consumers.[](#anchor "Direct link to ...")anchor links leak into the markdown. Also cleanable later with a custom rehype plugin./reference(the top-level reference index has no extractable content). Expected and benign.These are content-quality refinements that can be iterated on in follow-up PRs — they don't block consumers from using the output.
What's not in this PR
@signalwire/docusaurus-theme-llms-txtadoption (the "Copy Page" button). Independent UX feature, deferred.repository_dispatchto the skills repo on deploy yet — that lands as part of Phase 2 (skills-side workflow).🤖 Generated with Claude Code