webEmbedding

A website cloning tool that captures live pages with Playwright and reuses the original source where it can, instead of regenerating a page from a screenshot. It ships as a CLI, an MCP server, and a Claude Code / Codex skill.

Given a URL, it inspects the page, decides whether the page can be reused directly (iframe, embed, export) or has to be rebuilt from captured evidence, captures DOM / styles / assets / network traffic, generates a bounded HTML/CSS/React scaffold when a rebuild is needed, and compares the result against the original with visual, DOM, computed-style, interaction, and responsive-breakpoint checks.

It works best on static and semi-static pages — marketing sites, landing pages, docs, and frame-blocked pages that need capture-based reconstruction. See Limitations for what it does not do.

Features

Source-first routing — prefers direct iframe/embed reuse, then original export/preview/source, and only falls back to a bounded rebuild when reuse is unavailable.
Live browser capture — DOM snapshot, runtime HTML, full-page screenshot, computed-style and CSS summaries, asset inventory, HAR-style network metadata, and interaction states.
Bounded rebuild — reconstructs X-Frame-Options / CSP-blocked pages from captured evidence into reusable starter.html / starter.css / starter.tsx / Next.js scaffolds, preserving custom tags, shadow-root hosts, and semantic structure where captured.
Self-verification — screenshot, DOM, computed-style, and interaction-state similarity plus desktop/tablet/mobile breakpoint reports. Scores are generated by the verification pipeline, not hand-assigned.
Evidence reporting — separates directly captured artifacts from inferred or missing evidence, and marks auth-gated, app-gated, and native-app surfaces as bounded.
HAR replay — replays request specs against standard HAR, near-HAR, or captured network/manifest.json artifacts.
Job queue — a filesystem-backed async clone queue with durable JSON records, worker locks, retry scheduling, and cancellation.

Requirements

Node.js 18+
Python 3.9+
Chrome or Chromium for Playwright capture (the package depends on playwright-core and does not download a browser itself)

Installation

From npm:

npm install -g web-embedding
web-embedding install
web-embedding doctor

Or without a global install:

npx web-embedding install

As a Claude Code plugin (from the GitHub mirror):

/plugin marketplace add keiailab/webEmbedding
/plugin install source-first-clone@webembedding

Installing adds the source-first-clone plugin bundle, the exact-clone-intake skill, and the MCP server.

Usage

CLI

# Inspect a URL and print route hints
web-embedding inspect --url https://www.mozilla.org/

# Safe preflight: is this ready for reuse, capture, a session, manual review, or blocked?
web-embedding audit --url https://www.mozilla.org/

# Full clone workflow from a single URL
web-embedding clone \
  --url https://developer.mozilla.org/en-US/ \
  --output-dir ./.tmp/mdn-clone \
  --breakpoints mobile tablet

# Compare two capture bundles
web-embedding verify \
  --reference-bundle ./.tmp/reference/capture.json \
  --candidate-bundle ./.tmp/candidate/capture.json

Other subcommands: capture, reproduce, scaffold, benchmark, queue, har-replay, capabilities, paths, telemetry, uninstall. Run web-embedding <command> --help for flags.

A clone run writes artifacts under the output directory, including capture.json, captured dom/, styles/, assets/, network/ (with har.json and replay-report.json), interactions/, screenshots/, and a reproduction/ tree with the rebuild scaffold and self-verify/summary.json.

MCP server

For MCP clients that launch stdio servers over npm:

{
  "mcpServers": {
    "source-first-clone": {
      "command": "npx",
      "args": ["-y", "web-embedding@latest", "mcp"]
    }
  }
}

The server exposes URL inspection, capture, rebuild, verification, queue, and HAR-replay tools, including inspect_url, audit_reference_url, classify_clone_mode, capture_reference_bundle, build_rebuild_scaffold, clone_reference_url, verify_fidelity_report, and replay_har_requests.

A hosted, read-only intake endpoint is also available at https://webembedding-mcp.vercel.app/mcp. It exposes only low-risk routing tools (URL inspection, embed-candidate discovery, clone-mode classification, embed-snippet generation) — it does not run Playwright, read local files, or persist artifacts. Full capture, rebuild, and clone execution are local-only through the stdio package.

Skill

The exact-clone-intake skill triggers on requests like "clone this", "copy this page exactly", or the Korean 그대로 가져와줘 / 완전 똑같이. It drives the MCP tools in the inspect → capture → reuse-or-rebuild → verify order. It does not handle generic URL summarization, scraping, or any request to bypass auth, paywalls, captcha, ownership, or license boundaries.

Telemetry

Telemetry is disabled by default. An interactive web-embedding install asks once and defaults to No; non-interactive installs (CI, curl | bash) never prompt. If enabled, it sends an anonymous command-completion event (install id, version, command name, success/failure, OS/runtime basics) to a JSON endpoint you control. It never sends target URLs, local paths, captured HTML, screenshots, storage state, environment variables, or API keys. See docs/telemetry.md.

Limitations

This is not a full backend or app-logic clone engine. The hard boundary is server-side product behavior, not front-end evidence capture. The following are out of scope and need separate handling: login-only screens, app-first or native-app services, captcha-heavy sites, maps, games, canvas/WebGL-heavy pages, real-time feeds, payments, and booking flows.

Reconstructing a page's public structure is not a license or ownership bypass. Authenticated capture requires the caller to intentionally provide a storage_state_path or user_data_dir; the tool does not collect credentials or perform login bypasses, and local URL entrypoints reject non-HTTP schemes such as file://.

Contributing

The canonical repository is on GitLab; the GitHub repository is a read-only push mirror. Please open Merge Requests on GitLab. See CONTRIBUTING.md.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
.agents/plugins		.agents/plugins
.claude-plugin		.claude-plugin
.github/workflows		.github/workflows
bin		bin
bundle/source-first-clone		bundle/source-first-clone
deploy		deploy
docs		docs
evals		evals
fixtures/frame-shadow-parity		fixtures/frame-shadow-parity
python/web_embedding		python/web_embedding
scripts		scripts
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.npmignore		.npmignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
pyproject.toml		pyproject.toml
server.json		server.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

webEmbedding

Features

Requirements

Installation

Usage

CLI

MCP server

Skill

Telemetry

Limitations

Contributing

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

webEmbedding

Features

Requirements

Installation

Usage

CLI

MCP server

Skill

Telemetry

Limitations

Contributing

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages