Skip to content

docs: add plan for docs-driven skill generation#35

Open
Ethan-Arrowood wants to merge 1 commit into
mainfrom
plan/docs-driven-skills
Open

docs: add plan for docs-driven skill generation#35
Ethan-Arrowood wants to merge 1 commit into
mainfrom
plan/docs-driven-skills

Conversation

@Ethan-Arrowood
Copy link
Copy Markdown
Member

Summary

Adds docs/plans/docs-driven-skills.md — the design for auto-generating Harper skill rules from the documentation repo so they stop drifting from the source of truth.

This PR is the plan document only. No implementation, no behavior change. Merging it commits the team to the approach; the work itself is broken into phases inside the doc.

Background

Today the 20 rule files under harper-best-practices/rules/ are maintained by hand — sometimes with agent assistance, but a human still has to notice a docs change, prompt the rewrite, and open a PR. Drift example: the rest: true config prereq existed in reference/rest/overview.md long before the skill was patched to mention it.

What the plan covers

  • Concepts. Defines rule vs. skill vs. AGENTS.md vs. manifest, so the file purposes are explicit. Two generation modes: generate (auto-produced from docs) and synthesized (hand-authored, for content with no canonical docs source).
  • Guiding principle. Humans own the rule taxonomy; automation owns keeping rule bodies in sync with their declared sources.
  • User stories. Five workflows covering automated regen on docs prose changes, adding a new rule manually, authoring synthesized rules, fixing the manifest when docs structure changes, and adding a whole new skill.
  • Phased migration.
    • Phase 0 — manifest + lightweight validator, every existing rule mapped as synthesized (no behavior change today)
    • Phase 1 — one rule end-to-end (vector-indexing)
    • Phase 2 — expand to obvious .md-only rules
    • Phase 3 — flat-markdown export in HarperFast/documentation (Docusaurus plugin that flattens MDX components to plain markdown alongside the HTML build)
    • Phase 4 — MDX-sourced rules + observability
    • Phase 5 — steady state
  • Developer documentation. Substantially expands the existing .github/CONTRIBUTING.MD to explain repo anatomy, the generation pipeline, common contributor tasks, and what's automated vs. manual.
  • Validation layer. Spec for validate-generated.mjs (manifest completeness, provenance comments, must-cover assertions, MDX leakage, cross-link integrity, AGENTS.md round-trip).
  • Alternative: pointer strategy. Documented as a known fallback only — explicitly not in scope of this plan. If the generation approach disappoints, the team can revisit.

Reviewer notes

  • The plan does not lock down implementation specifics that should be debated during Phase 0 (manifest YAML shape is illustrative, not final).
  • Phase 3 is the largest cross-repo commitment and the one most worth a careful read. It assumes the docs team (same person, in this case) is willing to ship a flat-markdown export. If that's contentious, surface it now.
  • Phase 3 can run in parallel with Phase 2 — Phase 2 candidates all source from .md files.
  • The plan file itself is a planning artifact; once Phase 5 lands it can be archived. .github/CONTRIBUTING.MD is the long-lived companion.

🤖 Generated with Claude Code

Captures the design for auto-generating Harper skill rules from the
documentation repo. Covers concepts (rule/skill/manifest/modes), user
stories for automatic and manual workflows, a phased migration starting
from today's hand-authored rules, the validation layer, and a
documented (but out-of-scope) pointer-strategy fallback.

Phase 3 commits to a flat-markdown export from HarperFast/documentation
as the source of truth for MDX content, rather than parsing MDX
statically from the skills side.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Ethan-Arrowood Ethan-Arrowood requested a review from a team as a code owner May 22, 2026 19:02
Copy link
Copy Markdown
Member

@cb1kenobi cb1kenobi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fascinating!

Copy link
Copy Markdown
Member

@kriszyp kriszyp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fantastic, this is an excellent approach, thank you for putting this together!
I left some comments to consider as you build this, but I certainly think you can move forward with.

2. Skills `generate.yaml` runs: reads `rules.manifest.yaml`, fetches docs at that SHA, detects that the input hash for `querying-rest-apis` changed.
3. The generator calls Claude under the rule template, produces a new `rules/querying-rest-apis.md`, refreshes `AGENTS.md`, and updates the lock file.
4. Workflow opens a PR: `docs: regenerate rules from documentation@a1b2c3d`. The PR body lists which rules changed and links the upstream docs commit.
5. A maintainer reviews the diff — agent-facing prose still reads cleanly, the new edge case is mentioned. Merge.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the right approach for the beginning, but the goal would be to hopefully eliminate this, based on your guiding principle, right?

Anything that re-renders prose when docs change is automated

```
2. Runs `npm run generate` locally. The script produces `rules/streaming-uploads.md`, rebuilds `AGENTS.md`, updates the lock file.
3. Opens a PR titled `feat: add streaming-uploads rule`. PR includes the manifest change _and_ the generated body so reviewers can see what the agent will read.
4. After review and merge, semantic-release publishes a minor version (because `feat:`).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And likewise once we gain confidence in updates, hopefully additions would be fully automated (no human) as well per the guiding principle.


Work in `HarperFast/documentation`:

- Add a Docusaurus plugin (or remark/rehype pass in the existing build pipeline) that, for each MDX page, walks the AST and emits a flat-markdown rendering at `build/flat/<source-path>.md`. Component handling:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, the analysis here is that our current MDX files have too much noise from JSX components/tags, and that stripping it down to cleaner MD files? And that simply offering agent guidance for the renderer ("please ignore JSX components") is likely to be less efficient (for agents/LLMs) than reading docs with AST cleansing?
I don't know if this influences the technique, but I believe our source files are much closer to what we want agents to read than our generated HTML. A technique than can directly translate source to "flat" markdown without dealing with the HTML seems ideal.
I think this also solves the long-standing question of providing agent-optimized Markdown for public AI crawlers, hopefully in an efficient manner.


### Phase 4 — Awkward and MDX-sourced rules + observability

With flat-markdown available, take on the remaining rules — including those that source from `/learn` MDX content — and stand up the observability layer that catches automation failures.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are also going to consider regenerating existing skills content from the documentation source, right? I believe we should at least try that. Perhaps there are some existing skills (that would be considered "synthesized") that might be deemed high quality, but in general we want to actually replace our existing content in skills with the generated content (otherwise they are stuck in synthesized state hindering more automated regeneration).


These should be resolved before Phase 1 begins:

- Anthropic API key provisioning for the skills repo's Actions runner — who owns it.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have already gone down the path of acquiring an Anthropic API key for PR reviews, so hopefully that can be followed for this. I believe the API key generation is easy, just making sure we have the secret setup.
It seems like it is also worth considering the use of Gemini, and maybe Claude can build an option to use either. Again, we have tons of credits, so if economics start to play into this, that could be helpful (although I suspect this should be relatively inexpensive).


This section documents a secondary strategy we may pivot to in the future. **It is not part of the implementation scope of this plan** — we are not building for it, designing flags around it, or constraining the generation work to accommodate it. It exists in this document so the team has a known fallback if the generation approach disappoints.

If after Phase 2 or 3 the team decides generation isn't pulling its weight — auto-PRs are too noisy, prompt tuning never converges, or reviewer fatigue sets in — we pivot to **pointer mode**: embed the docs source directly into the skills repo (git submodule, subtree, or sparse checkout of `HarperFast/documentation`), and have each rule become a thin pointer file (frontmatter + "when to use" + a link into the embedded docs).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the hypothesis that generated skills/rules should be more succinct and conducive to LLMs remaining attentive to reading them, rather than LLMs starting to "skim" long embedded/linked documentation? Or is it partly the cleansing of JSX that benefits the skill? A "release-asset" (of cleansed flat markdown) as the source of skills could address that. Perhaps we might also want to consider more flexibility/hybrid-ness and offer "synthesized", "generated", and "flat" (or "direct") with the third option indicating that the source (flat) markdown file should be imported as-is without any LLM summarization.
I will say that I do believe the hypothesis that LLM summarization is likely to be better. But these might be good options to retain and compare.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants