A website cloning tool that captures live pages with Playwright and reuses the original source where it can, instead of regenerating a page from a screenshot. It ships as a CLI, an MCP server, and a Claude Code / Codex skill.
Given a URL, it inspects the page, decides whether the page can be reused directly (iframe, embed, export) or has to be rebuilt from captured evidence, captures DOM / styles / assets / network traffic, generates a bounded HTML/CSS/React scaffold when a rebuild is needed, and compares the result against the original with visual, DOM, computed-style, interaction, and responsive-breakpoint checks.
It works best on static and semi-static pages — marketing sites, landing pages, docs, and frame-blocked pages that need capture-based reconstruction. See Limitations for what it does not do.
- Source-first routing — prefers direct iframe/embed reuse, then original export/preview/source, and only falls back to a bounded rebuild when reuse is unavailable.
- Live browser capture — DOM snapshot, runtime HTML, full-page screenshot, computed-style and CSS summaries, asset inventory, HAR-style network metadata, and interaction states.
- Bounded rebuild — reconstructs
X-Frame-Options/ CSP-blocked pages from captured evidence into reusablestarter.html/starter.css/starter.tsx/ Next.js scaffolds, preserving custom tags, shadow-root hosts, and semantic structure where captured. - Self-verification — screenshot, DOM, computed-style, and interaction-state similarity plus desktop/tablet/mobile breakpoint reports. Scores are generated by the verification pipeline, not hand-assigned.
- Evidence reporting — separates directly captured artifacts from inferred or missing evidence, and marks auth-gated, app-gated, and native-app surfaces as bounded.
- HAR replay — replays request specs against standard HAR, near-HAR, or captured
network/manifest.jsonartifacts. - Job queue — a filesystem-backed async clone queue with durable JSON records, worker locks, retry scheduling, and cancellation.
- Node.js 18+
- Python 3.9+
- Chrome or Chromium for Playwright capture (the package depends on
playwright-coreand does not download a browser itself)
From npm:
npm install -g web-embedding
web-embedding install
web-embedding doctorOr without a global install:
npx web-embedding installAs a Claude Code plugin (from the GitHub mirror):
/plugin marketplace add keiailab/webEmbedding
/plugin install source-first-clone@webembedding
Installing adds the source-first-clone plugin bundle, the exact-clone-intake skill, and
the MCP server.
# Inspect a URL and print route hints
web-embedding inspect --url https://www.mozilla.org/
# Safe preflight: is this ready for reuse, capture, a session, manual review, or blocked?
web-embedding audit --url https://www.mozilla.org/
# Full clone workflow from a single URL
web-embedding clone \
--url https://developer.mozilla.org/en-US/ \
--output-dir ./.tmp/mdn-clone \
--breakpoints mobile tablet
# Compare two capture bundles
web-embedding verify \
--reference-bundle ./.tmp/reference/capture.json \
--candidate-bundle ./.tmp/candidate/capture.jsonOther subcommands: capture, reproduce, scaffold, benchmark, queue, har-replay,
capabilities, paths, telemetry, uninstall. Run web-embedding <command> --help for flags.
A clone run writes artifacts under the output directory, including capture.json, captured
dom/, styles/, assets/, network/ (with har.json and replay-report.json),
interactions/, screenshots/, and a reproduction/ tree with the rebuild scaffold and
self-verify/summary.json.
For MCP clients that launch stdio servers over npm:
{
"mcpServers": {
"source-first-clone": {
"command": "npx",
"args": ["-y", "web-embedding@latest", "mcp"]
}
}
}The server exposes URL inspection, capture, rebuild, verification, queue, and HAR-replay
tools, including inspect_url, audit_reference_url, classify_clone_mode,
capture_reference_bundle, build_rebuild_scaffold, clone_reference_url,
verify_fidelity_report, and replay_har_requests.
A hosted, read-only intake endpoint is also available at
https://webembedding-mcp.vercel.app/mcp. It exposes only low-risk routing tools (URL
inspection, embed-candidate discovery, clone-mode classification, embed-snippet generation) —
it does not run Playwright, read local files, or persist artifacts. Full capture, rebuild, and
clone execution are local-only through the stdio package.
The exact-clone-intake skill triggers on requests like "clone this", "copy this page
exactly", or the Korean 그대로 가져와줘 / 완전 똑같이. It drives the MCP tools in the
inspect → capture → reuse-or-rebuild → verify order. It does not handle generic URL
summarization, scraping, or any request to bypass auth, paywalls, captcha, ownership, or
license boundaries.
Telemetry is disabled by default. An interactive web-embedding install asks once and
defaults to No; non-interactive installs (CI, curl | bash) never prompt. If enabled, it
sends an anonymous command-completion event (install id, version, command name,
success/failure, OS/runtime basics) to a JSON endpoint you control. It never sends target
URLs, local paths, captured HTML, screenshots, storage state, environment variables, or API
keys. See docs/telemetry.md.
This is not a full backend or app-logic clone engine. The hard boundary is server-side product behavior, not front-end evidence capture. The following are out of scope and need separate handling: login-only screens, app-first or native-app services, captcha-heavy sites, maps, games, canvas/WebGL-heavy pages, real-time feeds, payments, and booking flows.
Reconstructing a page's public structure is not a license or ownership bypass. Authenticated
capture requires the caller to intentionally provide a storage_state_path or user_data_dir;
the tool does not collect credentials or perform login bypasses, and local URL entrypoints
reject non-HTTP schemes such as file://.
The canonical repository is on GitLab; the GitHub repository is a read-only push mirror. Please open Merge Requests on GitLab. See CONTRIBUTING.md.