Skip to content

feat: adopt @signalwire/docusaurus-plugin-llms-txt (Phase 1)#497

Merged
Ethan-Arrowood merged 2 commits into
mainfrom
phase-1/llms-txt-plugin
May 26, 2026
Merged

feat: adopt @signalwire/docusaurus-plugin-llms-txt (Phase 1)#497
Ethan-Arrowood merged 2 commits into
mainfrom
phase-1/llms-txt-plugin

Conversation

@Ethan-Arrowood
Copy link
Copy Markdown
Member

Phase 1 of the docs-driven skill generation plan

Adopts @signalwire/docusaurus-plugin-llms-txt so every rendered docs page also publishes a flat-markdown sibling at the same path (e.g. /reference/v5/rest/overview/reference/v5/rest/overview.md), plus llms.txt and llms-full.txt index files at the build root.

The plan that drives this work lives in HarperFast/skills → Migration → Phase 1. This PR is the docs-repo side; the skills-repo side (Phase 2 — the generator + workflow that consumes this output) is separate.

Why this plugin

It operates on Docusaurus's postBuild route data, which surfaces routes from every registered docs plugin instance — learn, reference (with both v5 and v4), fabric, release-notes — uniformly. The HTML→markdown conversion via unified runs after the build, so MDX components, theme imports, custom React components (<VersionBadge>, <LatestPatchLink>, etc.), build-time data, and partial inclusions are all already resolved by the time the plugin sees them. No per-component handlers needed.

Community context: facebook/docusaurus#10899. This plugin (by SignalWire) is the de facto community choice, ~19k weekly downloads, MIT-licensed.

We previously tested the alternative docusaurus-plugin-llms (by rachfop) — it only picked up one of our four docs plugin instances, which is a hard incompatibility with how this site is configured.

What's in this PR

docusaurus.config.ts

Register the plugin with minimal config — defaults handle most things sensibly:

[
  '@signalwire/docusaurus-plugin-llms-txt',
  {
    content: {
      // Defaults: enableMarkdownFiles: true, includeDocs: true,
      // includeVersionedDocs: true. All four docs plugin instances
      // are picked up automatically; v4 reference docs are included
      // so the public artifacts cover the full site.
      enableLlmsFullTxt: true,
    },
  },
],

Notes:

  • No excludeRoutes: we emit flat markdown for the full site (including v4 reference docs) so the artifacts are useful to any consumer. The skills repo's manifest decides which routes to actually use.
  • Default contentSelectors work for the classic theme; no overrides needed.
  • enableLlmsFullTxt: true produces the bundled llms-full.txt alongside the per-page files. Useful for LLM tools that want one file.

.github/workflows/deploy.yaml

Adds a verification step after the build that fails the workflow if any of the four docs instances produced zero .md files, or if llms.txt / llms-full.txt are missing. Catches plugin regressions before they reach the deployed site.

package.json

Adds @signalwire/docusaurus-plugin-llms-txt as a devDependency.

Verification

npm run build locally produces:

Path Count
build/learn/**/*.md 10
build/reference/**/*.md 137
build/fabric/**/*.md 10
build/release-notes/**/*.md 211
build/llms.txt 385 lines
build/llms-full.txt 36,913 lines
Total .md files 371

Spot-checked content across all four docs instances — pages render correctly through the HTML→MD round-trip. Tables, code blocks, headings, internal links all preserved.

Known minor cosmetic issues (not blocking)

  • <VersionBadge> renders as multi-segment text (e.g. Added in<!-- --> : <!-- -->v4.2.0). The semantic content is preserved; the spacing is just awkward. Can be improved later via custom rehypePlugins if it bothers downstream consumers.
  • Docusaurus's auto-generated [​](#anchor "Direct link to ...") anchor links leak into the markdown. Also cleanable later with a custom rehype plugin.
  • One warning during build for /reference (the top-level reference index has no extractable content). Expected and benign.

These are content-quality refinements that can be iterated on in follow-up PRs — they don't block consumers from using the output.

What's not in this PR

  • No theme-side @signalwire/docusaurus-theme-llms-txt adoption (the "Copy Page" button). Independent UX feature, deferred.
  • No skills-repo work — that's Phase 2.
  • No repository_dispatch to the skills repo on deploy yet — that lands as part of Phase 2 (skills-side workflow).

🤖 Generated with Claude Code

Implements Phase 1 of the docs-driven skill generation plan (lives in
HarperFast/skills repo at docs/plans/docs-driven-skills.md). Adds a
postBuild step that converts every rendered HTML page to a flat-
markdown sibling (foo.html → foo.md) and emits llms.txt / llms-full.txt
index files at the build root.

Why this plugin specifically: it operates on Docusaurus's postBuild
route data, which captures routes from every registered docs plugin
instance (learn, reference, fabric, release-notes) uniformly. The
HTML→markdown conversion via `unified` means MDX components, theme
imports, custom React components, and build-time data are all
already resolved before we see them — no per-component handlers or
module shims required. See facebook/docusaurus#10899 for the broader
community context; this plugin (by SignalWire) is the de facto choice.

Changes:

- docusaurus.config.ts: register `@signalwire/docusaurus-plugin-llms-txt`
  with `enableLlmsFullTxt: true`. Defaults handle everything else —
  all four docs plugin instances are picked up automatically, and the
  default contentSelectors work for the classic theme. No excludeRoutes:
  we emit flat markdown for the full site (including v4 reference docs)
  so the artifacts are useful to any consumer; the skills repo's
  manifest decides which routes to actually use.

- .github/workflows/deploy.yaml: add a verification step after the
  build that fails the workflow if any of the four docs instances
  produced zero .md files, or if llms.txt / llms-full.txt are missing.
  Catches plugin regressions before they reach the deployed site.

- package.json: add @signalwire/docusaurus-plugin-llms-txt as a
  devDependency.

Verified locally: `npm run build` produces 371 .md files across all
four docs instances (learn: 10, reference: 137, fabric: 10,
release-notes: 211) plus llms.txt (385 lines) and llms-full.txt
(36913 lines). Spot-checked a v5 reference page, a learn MDX page
using imported components, a fabric page, and a release-notes page —
content renders correctly through the HTML→MD round-trip.

The plugin reports one warning about /reference (the empty top-level
reference index page that has no extractable content) — expected and
benign.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Ethan-Arrowood Ethan-Arrowood requested a review from a team as a code owner May 26, 2026 17:04
@socket-security
Copy link
Copy Markdown

socket-security Bot commented May 26, 2026

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Added@​signalwire/​docusaurus-plugin-llms-txt@​1.2.28710010088100

View full report

@github-actions github-actions Bot temporarily deployed to pr-497 May 26, 2026 17:07 Inactive
@github-actions
Copy link
Copy Markdown

🚀 Preview Deployment

Your preview deployment is ready!

🔗 Preview URL: https://preview.harper-documentation.harperfabric.com/pr-497

This preview will update automatically when you push new commits.

Copy link
Copy Markdown
Member

@kriszyp kriszyp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. This is actually really nice.
Let's jump right to the hard stuff though, https://preview.harper-documentation.harperfabric.com/pr-497/learn/getting-started/create-your-first-application.md:
So the way it does tabs is like this: Local and Fabric are the tabs, it lists both tabs and then outputs their contents sequentially. I am not sure that is ideal, seems like the * Fabric header should go after the Local Installation section:

* Local Installation
* Fabric

Get started by cloning the [`HarperFast/create-your-first-application`](https://github.com/HarperFast/create-your-first-application) repo and opening it your editor of choice. If you have installed Harper using a container, make sure to clone into the `dev/` directory that the container was mounted to.

``
git clone https://github.com/HarperFast/create-your-first-application.git first-harper-app
``

From the "Cluster" page, navigate to the "Applications" tab and click on "New Application" on the left-hand sidebar.

Give the application a name such as "first-harper-app", then click on the "Import" tab.

Specify `https://github.com/HarperFast/create-your-first-application` in the "Git Repository URL" field.

Keep the "Install Command" empty and the "Authorization" as "Public Access".

Finally, click the "Import Application" button and wait for the application to be instantiated.

That being said, this impressively clean. Much better than I expected. I think there are tweaks we should make, but I approve moving forward with this.

@Ethan-Arrowood
Copy link
Copy Markdown
Member Author

Okay great, I'll see if we can iterate on the output now before moving to phase 2 of the skill update plan.

The default HTML→Markdown conversion in @signalwire/docusaurus-plugin-
llms-txt produces noisy or misleading output for several Docusaurus-
specific constructs. This commit adds a small rehype plugin
(scripts/rehype-docusaurus-to-llms.mjs) wired into the plugin's
`beforeDefaultRehypePlugins` chain to normalize them before the
HTML→MD conversion runs.

Three transforms:

1. Tabs (<div class="tabs-container">) → sequential `#### h4`
   subsections. Previously the default conversion stacked tab labels as
   a bullet list followed by all panel contents concatenated together,
   making it read as if every tab's content applied under every label.
   Now each tab is a properly labeled subsection containing its own
   content. 245 subsections produced across the site by this transform.

2. Hash-link anchors (<a class="hash-link">) → removed. These are
   Docusaurus's "direct link to this heading" UI affordances with no
   semantic value to an LLM. Default output included `[​](#anchor
   "Direct link to ...")` noise next to every heading; now gone.

3. Version badges (<span class="badge_*">Added in<!-- -->: ...</span>)
   → clean italic text. The React render leaves empty comment markers
   between text fragments which the default conversion preserved as
   literal `<!-- -->` strings. Now emits `*Added in: v4.2.0*`.

All three constructs are uniformly handled across all four docs
plugin instances (learn, reference, fabric, release-notes) because
the rehype pass runs at the HTML stage where every page's content
looks the same regardless of its source MDX.

Spot-checked output before/after on build/learn/getting-started/
create-your-first-application.md (heavy tabs usage) and
build/reference/v5/rest/overview.md (VersionBadge); both now read
cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Ethan-Arrowood
Copy link
Copy Markdown
Member Author

Update — addressed the cosmetic issues called out in the original PR description

Added a small custom rehype plugin (scripts/rehype-docusaurus-to-llms.mjs) wired into @signalwire/docusaurus-plugin-llms-txt via its beforeDefaultRehypePlugins hook. It runs three Docusaurus-specific cleanups before the HTML→MD conversion:

Construct Before After
Tabs (e.g. Local Installation / Fabric) All labels stacked as a bullet list, then all panel contents concatenated under it Each tab → labeled #### h4 subsection with its own content (245 subsections produced site-wide)
Heading anchor links (<a class="hash-link">) [​](#anchor "Direct link to ...") noise next to every heading Removed entirely
<VersionBadge> (<span class="badge_*">) Added in<!-- --> : <!-- -->v4.2.0 *Added in: v4.2.0*

These were called out as "known minor cosmetic issues" in the original PR description. They're addressable from the HTML side because Docusaurus emits very consistent, identifiable markup for each construct.

The plugin is small (~150 lines, well-commented) and lives at scripts/rehype-docusaurus-to-llms.mjs. New constructs can be added as additional transforms in the same file as they come up.

Verified end-to-end: same 371 .md files produced, but now with substantially better content quality. Spot-checked build/learn/getting-started/create-your-first-application.md (heavy tabs usage) and build/reference/v5/rest/overview.md (VersionBadge) — both now read cleanly.

@github-actions github-actions Bot temporarily deployed to pr-497 May 26, 2026 18:06 Inactive
@github-actions
Copy link
Copy Markdown

🚀 Preview Deployment

Your preview deployment is ready!

🔗 Preview URL: https://preview.harper-documentation.harperfabric.com/pr-497

This preview will update automatically when you push new commits.

@Ethan-Arrowood
Copy link
Copy Markdown
Member Author

@Ethan-Arrowood Ethan-Arrowood merged commit 8384bed into main May 26, 2026
7 checks passed
@Ethan-Arrowood Ethan-Arrowood deleted the phase-1/llms-txt-plugin branch May 26, 2026 18:42
@github-actions
Copy link
Copy Markdown

🧹 Preview Cleanup

The preview deployment for this PR has been removed.

Ethan-Arrowood added a commit that referenced this pull request May 26, 2026
Two unrelated warnings have been firing during every docs build since
Phase 1 (#497) landed. Both are now fixed; an unrelated third warning
("Excluded N routes by current config") remains but is just accurate
accounting of intentional exclusions.

1. /reference route processing failure

   src/pages/reference/index.tsx is a client-side React redirect
   component that returns null. Its rendered HTML is an empty
   Docusaurus shell with no extractable content. The llms-txt plugin
   was attempting to process it and emitting a warning ("Route Error:
   Failed to process route '/reference': Failed to convert HTML to
   Markdown: HTML to Markdown conversion resulted in empty content").

   Fix: add '/reference' to the plugin's excludeRoutes config. This is
   not a real content page; it's a routing redirect that should never
   be considered for the flat-markdown export.

2. Broken anchor in resource-api.md

   reference/resources/resource-api.md line 799 linked to
   `../components/javascript-environment.md#transaction`, but the
   heading `### \`transaction(fn)\`` in that target file gets the
   auto-generated anchor `transactionfn` (parens stripped, dash dropped).
   The link wanted the semantic name `transaction`, not the auto-id.

   Fix: add an explicit anchor `{#transaction}` to the heading in
   javascript-environment.md. This gives the link a stable, human-
   meaningful target that survives heading-text changes and matches the
   author's original intent. Docusaurus 3.x supports the {#id} syntax
   natively.

Verified: rebuild produces the same 371 documents, both warnings are
gone, and grep confirms the rendered HTML now has id="transaction"
directly on the heading.

The remaining "Excluded 2 routes by current config" warning is the
plugin reporting that two routes were filtered: the home page `/`
(a content-pages route, excluded by default via includePages: false)
and `/reference` (our new explicit exclusion). Both are intentional;
the warning is accurate signal, not noise.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants