From d4a0bfe8848921236778997e22559a83bf92e905 Mon Sep 17 00:00:00 2001 From: YurunChen <1657503372@qq.com> Date: Fri, 15 May 2026 11:03:02 +0800 Subject: [PATCH] Improve clone-website skill fidelity guardrails. Adds snapshot contract requirements, entity-level asset-text binding rules, interactive parity guidance, and hard acceptance gates so new mirrors converge to higher visual fidelity with fewer semantic alignment regressions. --- .claude/skills/clone-website/SKILL.md | 36 +++++++++++++++++++++++++++ 1 file changed, 36 insertions(+) diff --git a/.claude/skills/clone-website/SKILL.md b/.claude/skills/clone-website/SKILL.md index de4ba46..7e6cd44 100644 --- a/.claude/skills/clone-website/SKILL.md +++ b/.claude/skills/clone-website/SKILL.md @@ -57,6 +57,14 @@ in three places (must stay in sync): Modern target sites (Amazon, Booking, Apple, Coursera, ...) are JS-heavy SPAs / hydrated React apps. `requests.get(url)` returns an empty shell — no products, no images, no cards. Recon and scraping both **must** be done by driving a real Chromium via Playwright (or an equivalent real-browser tool); only that path produces the rendered DOM and the real image URLs the live site actually serves. The `agent_demo/` env already has Playwright + Chromium installed via `uv sync` — reuse it. +Before scraping, define a **Snapshot Contract** for this mirror and save it under `sites//scraped_data/snapshot_contract.json`: + +- `captured_at`: one timestamp for the whole recon pass +- `locale`, `timezone`, `viewport`: fixed browser settings used for all captures +- `modules`: canonical upstream URL per mirrored module/page type (`home`, `list`, `detail`, `search`, `auth`, ...) + +Do not switch upstream URLs mid-implementation for the same module. If a URL changes, update the contract and recapture all affected evidence. + Minimum scraping recipe — render the page, then pull the post-hydration DOM and the resolved image `src` attributes: ```python @@ -123,6 +131,14 @@ Drive the live site with Playwright (recipe above) and download assets into `scr **Critical**: use REAL images from the live site, captured via Playwright + a follow-up `httpx.get` of the resolved URL. Never use placeholders, colored rectangles, AI-generated stock photos, or `requests.get(target_url)` HTML (it returns a JS shell without the image URLs). Multimodal fidelity is a core WebHarbor differentiator and is the #1 reason agents reject reviews. +Apply **Resource Alignment Rules** before seeding: + +- Build an entity-level mapping file (`scraped_data/entity_assets.jsonl`) with at least: + - `entity_key` (stable id/slug), `title`, `upstream_url`, `image_url`, `local_path` +- Seed rows must join assets by `entity_key` (or another stable key), not by list index position. +- If a real upstream image exists for an entity, do not replace it with fallback/default images. +- During review, sample-check `title -> image -> upstream_url` triplets for semantic consistency. + ### Step 4: Backend build Edit `sites//app.py`: @@ -167,6 +183,12 @@ Create Jinja2 templates under `sites//templates/`: Match the original site's color scheme, typography, and navigation. Don't ship a generic Bootstrap theme. +Enforce **Interactive Parity** for major controls: + +- Tabs/chips/filters/sort controls must change content state, not only active CSS class. +- If a control is visible in UI, it should have deterministic backend behavior. +- Remove benchmark-only explanatory copy from user-facing surfaces. + ### Step 6: Seed data Edit `sites//seed_data.py` so that `seed_database()` is **idempotent**: @@ -215,10 +237,24 @@ docker exec wh-test md5sum \ /opt/WebSyn//instance/.db \ /opt/WebSyn//instance_seed/.db # both md5s MUST match + +# restart determinism gate (must still match after restart) +docker restart wh-test && sleep 5 +docker exec wh-test md5sum \ + /opt/WebSyn//instance/.db \ + /opt/WebSyn//instance_seed/.db +# both md5s MUST still match ``` Then drive the mirror through Playwright (same recipe as recon, but pointing at `http://localhost:41000+i/`): screenshot the homepage, one listing page, one detail page, the login flow, and one search. Diff visually against the screenshots you captured in Step 2. If something looks like a coloured rectangle or "Image" alt text, you're missing real assets — go back to Step 3. +Hard acceptance gates (all required): + +- **Broken image gate**: key pages should have zero broken image links (no 404 image URLs in rendered HTML/network logs). +- **Semantic gate**: sampled cards/entities keep `title-image-link` aligned with upstream entity meaning. +- **Visual gate**: screenshot compare for homepage + at least 2 core modules (list/detail or equivalent). +- **Determinism gate**: md5 match after reset and after container restart (shown above). + ## Output After Phase 1, you should have: