Skip to content

keiailab/webEmbedding

Repository files navigation

webEmbedding

License: MIT Python 3.9+ Node 18+

A website cloning tool that captures live pages with Playwright and reuses the original source where it can, instead of regenerating a page from a screenshot. It ships as a CLI, an MCP server, and a Claude Code / Codex skill.

Given a URL, it inspects the page, decides whether the page can be reused directly (iframe, embed, export) or has to be rebuilt from captured evidence, captures DOM / styles / assets / network traffic, generates a bounded HTML/CSS/React scaffold when a rebuild is needed, and compares the result against the original with visual, DOM, computed-style, interaction, and responsive-breakpoint checks.

It works best on static and semi-static pages — marketing sites, landing pages, docs, and frame-blocked pages that need capture-based reconstruction. See Limitations for what it does not do.

Features

  • Source-first routing — prefers direct iframe/embed reuse, then original export/preview/source, and only falls back to a bounded rebuild when reuse is unavailable.
  • Live browser capture — DOM snapshot, runtime HTML, full-page screenshot, computed-style and CSS summaries, asset inventory, HAR-style network metadata, and interaction states.
  • Bounded rebuild — reconstructs X-Frame-Options / CSP-blocked pages from captured evidence into reusable starter.html / starter.css / starter.tsx / Next.js scaffolds, preserving custom tags, shadow-root hosts, and semantic structure where captured.
  • Self-verification — screenshot, DOM, computed-style, and interaction-state similarity plus desktop/tablet/mobile breakpoint reports. Scores are generated by the verification pipeline, not hand-assigned.
  • Evidence reporting — separates directly captured artifacts from inferred or missing evidence, and marks auth-gated, app-gated, and native-app surfaces as bounded.
  • HAR replay — replays request specs against standard HAR, near-HAR, or captured network/manifest.json artifacts.
  • Job queue — a filesystem-backed async clone queue with durable JSON records, worker locks, retry scheduling, and cancellation.

Requirements

  • Node.js 18+
  • Python 3.9+
  • Chrome or Chromium for Playwright capture (the package depends on playwright-core and does not download a browser itself)

Installation

From npm:

npm install -g web-embedding
web-embedding install
web-embedding doctor

Or without a global install:

npx web-embedding install

As a Claude Code plugin (from the GitHub mirror):

/plugin marketplace add keiailab/webEmbedding
/plugin install source-first-clone@webembedding

Installing adds the source-first-clone plugin bundle, the exact-clone-intake skill, and the MCP server.

Usage

CLI

# Inspect a URL and print route hints
web-embedding inspect --url https://www.mozilla.org/

# Safe preflight: is this ready for reuse, capture, a session, manual review, or blocked?
web-embedding audit --url https://www.mozilla.org/

# Full clone workflow from a single URL
web-embedding clone \
  --url https://developer.mozilla.org/en-US/ \
  --output-dir ./.tmp/mdn-clone \
  --breakpoints mobile tablet

# Compare two capture bundles
web-embedding verify \
  --reference-bundle ./.tmp/reference/capture.json \
  --candidate-bundle ./.tmp/candidate/capture.json

Other subcommands: capture, reproduce, scaffold, benchmark, queue, har-replay, capabilities, paths, telemetry, uninstall. Run web-embedding <command> --help for flags.

A clone run writes artifacts under the output directory, including capture.json, captured dom/, styles/, assets/, network/ (with har.json and replay-report.json), interactions/, screenshots/, and a reproduction/ tree with the rebuild scaffold and self-verify/summary.json.

MCP server

For MCP clients that launch stdio servers over npm:

{
  "mcpServers": {
    "source-first-clone": {
      "command": "npx",
      "args": ["-y", "web-embedding@latest", "mcp"]
    }
  }
}

The server exposes URL inspection, capture, rebuild, verification, queue, and HAR-replay tools, including inspect_url, audit_reference_url, classify_clone_mode, capture_reference_bundle, build_rebuild_scaffold, clone_reference_url, verify_fidelity_report, and replay_har_requests.

A hosted, read-only intake endpoint is also available at https://webembedding-mcp.vercel.app/mcp. It exposes only low-risk routing tools (URL inspection, embed-candidate discovery, clone-mode classification, embed-snippet generation) — it does not run Playwright, read local files, or persist artifacts. Full capture, rebuild, and clone execution are local-only through the stdio package.

Skill

The exact-clone-intake skill triggers on requests like "clone this", "copy this page exactly", or the Korean 그대로 가져와줘 / 완전 똑같이. It drives the MCP tools in the inspect → capture → reuse-or-rebuild → verify order. It does not handle generic URL summarization, scraping, or any request to bypass auth, paywalls, captcha, ownership, or license boundaries.

Telemetry

Telemetry is disabled by default. An interactive web-embedding install asks once and defaults to No; non-interactive installs (CI, curl | bash) never prompt. If enabled, it sends an anonymous command-completion event (install id, version, command name, success/failure, OS/runtime basics) to a JSON endpoint you control. It never sends target URLs, local paths, captured HTML, screenshots, storage state, environment variables, or API keys. See docs/telemetry.md.

Limitations

This is not a full backend or app-logic clone engine. The hard boundary is server-side product behavior, not front-end evidence capture. The following are out of scope and need separate handling: login-only screens, app-first or native-app services, captcha-heavy sites, maps, games, canvas/WebGL-heavy pages, real-time feeds, payments, and booking flows.

Reconstructing a page's public structure is not a license or ownership bypass. Authenticated capture requires the caller to intentionally provide a storage_state_path or user_data_dir; the tool does not collect credentials or perform login bypasses, and local URL entrypoints reject non-HTTP schemes such as file://.

Contributing

The canonical repository is on GitLab; the GitHub repository is a read-only push mirror. Please open Merge Requests on GitLab. See CONTRIBUTING.md.

License

MIT

About

Source-first Skill and MCP workflow for URL-based website cloning, capture, rebuild, and fidelity verification.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors