refactor: split site (Node) and automation (Python) responsibilities#63
refactor: split site (Node) and automation (Python) responsibilities#63kimchanhyung98 wants to merge 4 commits intodevelopfrom
Conversation
Move scripts/markdown-link-utils.mjs and the translation-only validators into .github/docs-updater so the update-docs workflow can fail before committing translated output. structure_validator now runs at the end of main.py with the same JS semantics (anchor / heading / internal-link diff). Debugging CLIs (find_link_context, find_missing_links) move with their dependency. Co-Authored-By: Claude <noreply@anthropic.com>
build_redirect_generator generates the unversioned -> latest stable redirect HTML for each locale (both /docs/<slug>/index.html and /docs/<slug>.html shapes). build_anchor_validator checks that every markdown #fragment in versioned_docs/ resolves to an actual id in the built HTML, including the .html cleanUrls variant. deploy.yml now sets up uv alongside Node and runs both tools after npm run build, while update-docs.yml is Node-free and Python only. Also drops the matching scripts/*.mjs and the prebuild/postbuild hook in package.json. Co-Authored-By: Claude <noreply@anthropic.com>
Drop prebuild/postbuild hooks, sync:versions, and validate-anchors from package.json so site tooling no longer reaches into the automation domain. serve still ships its own static server (handles the dotted 13.x directory and .html cleanUrls fallback) and serve:docusaurus stays as the upstream baseline. playwright.config.ts now boots npm run start for the e2e webServer; tests that depend on prod-build behaviour either verify anchor mapping directly (docs-rendering) or wait for hydration (homepage), and the build-output existence assertions move out (build exit code is the source of truth). Co-Authored-By: Claude <noreply@anthropic.com>
workflow.md spells out the responsibility boundary between the Node site (hosting + landing page) and the Python automation (.github/docs-updater + workflows). README mirrors the deploy/update split so contributors hit the right pipeline. test_project_boundaries fails if any npm script ever calls into .github/ or python so the divide cannot regress silently. Co-Authored-By: Claude <noreply@anthropic.com>
|
There was a problem hiding this comment.
Code Review
This pull request refactors the documentation update and deployment workflow by migrating legacy Node.js scripts to a Python-based automation suite. The changes introduce robust tools for anchor validation, redirect generation, and structural consistency checks between source and translated documents. Review feedback highlights opportunities to improve Markdown parsing for nested brackets, refine anchor and heading extraction using regular expressions to prevent false positives, and broaden external URL detection to support additional protocols.
| def extract_markdown_links(text: str) -> list[MarkdownLink]: | ||
| """`[label](url)` 형태의 링크를 모두 추출. fenced/inline code 안은 무시.""" | ||
| stripped = strip_code(text) | ||
| links: list[MarkdownLink] = [] | ||
| i = 0 | ||
| length = len(stripped) | ||
| while i < length: | ||
| label_start = stripped.find("[", i) | ||
| if label_start < 0: | ||
| break | ||
|
|
||
| label_end = stripped.find("]", label_start + 1) | ||
| if label_end < 0: | ||
| break | ||
|
|
||
| if label_end + 1 >= length or stripped[label_end + 1] != "(": | ||
| i = label_end + 1 | ||
| continue | ||
|
|
||
| url_end = stripped.find(")", label_end + 2) | ||
| if url_end < 0: | ||
| break | ||
|
|
||
| url = _strip_title_suffix(stripped[label_end + 2 : url_end]) | ||
| if url: | ||
| links.append( | ||
| MarkdownLink( | ||
| text=stripped[label_start + 1 : label_end], | ||
| url=url, | ||
| ) | ||
| ) | ||
| i = url_end + 1 | ||
| return links |
There was a problem hiding this comment.
현재의 루프 기반 파싱 로직은 [label with [nested] brackets](url)와 같이 링크 라벨 내부에 대괄호가 포함된 경우를 올바르게 처리하지 못합니다(첫 번째 ]에서 라벨이 끝난 것으로 간주함). 정규표현식을 사용하면 중첩된 대괄호(한 단계 수준)를 포함한 링크를 더 안정적으로 추출할 수 있습니다.
def extract_markdown_links(text: str) -> list[MarkdownLink]:
"""`[label](url)` 형태의 링크를 모두 추출. fenced/inline code 안은 무시."""
stripped = strip_code(text)
links: list[MarkdownLink] = []
# 중첩된 대괄호를 한 단계까지 허용하는 정규표현식
pattern = r'\[((?:[^\[\]]|\[[^\[\]]*\])*)\]\(([^)]+)\)'
for match in re.finditer(pattern, stripped):
label, raw_url = match.groups()
url = _strip_title_suffix(raw_url)
if url:
links.append(MarkdownLink(text=label, url=url))
return links| def extract_anchors(text: str) -> list[str]: | ||
| """`<a name="...">` 명시적 앵커를 코드 영역을 제외하고 추출.""" | ||
| anchors: list[str] = [] | ||
| stripped = strip_code(text) | ||
| index = 0 | ||
| length = len(stripped) | ||
| while index < length: | ||
| tag_start = stripped.find("<a", index) | ||
| if tag_start < 0: | ||
| break | ||
|
|
||
| tag_end = stripped.find(">", tag_start + 2) | ||
| if tag_end < 0: | ||
| break | ||
|
|
||
| tag = stripped[tag_start : tag_end + 1] | ||
| name_pos = tag.find("name=") | ||
| if name_pos >= 0: | ||
| quote_index = name_pos + len("name=") | ||
| if quote_index < len(tag): | ||
| quote = tag[quote_index] | ||
| if quote in ('"', "'"): | ||
| value_start = quote_index + 1 | ||
| value_end = tag.find(quote, value_start) | ||
| if value_end >= 0: | ||
| anchors.append(tag[value_start:value_end]) | ||
| index = tag_end + 1 | ||
| return anchors |
There was a problem hiding this comment.
extract_anchors 함수에서 stripped.find("<a", index)를 사용하면 <area>, <address> 등 <a> 태그가 아닌 요소와도 매칭될 수 있습니다. 또한 name= 검색 시 data-name= 등 의도하지 않은 속성이 포함될 위험이 있습니다. 정규표현식을 사용하여 정확히 <a> 태그의 name 속성만 추출하는 것이 안전합니다.
| def extract_anchors(text: str) -> list[str]: | |
| """`<a name="...">` 명시적 앵커를 코드 영역을 제외하고 추출.""" | |
| anchors: list[str] = [] | |
| stripped = strip_code(text) | |
| index = 0 | |
| length = len(stripped) | |
| while index < length: | |
| tag_start = stripped.find("<a", index) | |
| if tag_start < 0: | |
| break | |
| tag_end = stripped.find(">", tag_start + 2) | |
| if tag_end < 0: | |
| break | |
| tag = stripped[tag_start : tag_end + 1] | |
| name_pos = tag.find("name=") | |
| if name_pos >= 0: | |
| quote_index = name_pos + len("name=") | |
| if quote_index < len(tag): | |
| quote = tag[quote_index] | |
| if quote in ('"', "'"): | |
| value_start = quote_index + 1 | |
| value_end = tag.find(quote, value_start) | |
| if value_end >= 0: | |
| anchors.append(tag[value_start:value_end]) | |
| index = tag_end + 1 | |
| return anchors | |
| def extract_anchors(text: str) -> list[str]: | |
| """<a name=\"...\"> 명시적 앵커를 코드 영역을 제외하고 추출.""" | |
| import re | |
| stripped = strip_code(text) | |
| return re.findall(r'<a\s+[^>]*name=["\']([^"\']+)["\']', stripped, re.IGNORECASE) |
| def extract_headings(text: str) -> list[Heading]: | ||
| """`#` 으로 시작하는 ATX heading 만 추출. setext heading은 다루지 않는다.""" | ||
| stripped = strip_code(text) | ||
| headings: list[Heading] = [] | ||
| for line in stripped.split("\n"): | ||
| level = 0 | ||
| while level < len(line) and line[level] == "#": | ||
| level += 1 | ||
| if level < 1 or level > 6: | ||
| continue | ||
| if level >= len(line): | ||
| continue | ||
| if line[level] not in (" ", "\t"): | ||
| continue | ||
| headings.append(Heading(level=level, text=line[level + 1 :].strip())) | ||
| return headings |
There was a problem hiding this comment.
extract_headings 함수가 행 시작 부분의 공백(Indentation)을 고려하지 않습니다. Markdown 명세(CommonMark)에 따르면 헤딩 앞에 최대 3개의 공백이 허용되므로, lstrip()을 사용하여 이를 처리하는 것이 좋습니다. 또한 인덱스 범위를 벗어나는 에러를 방지하기 위해 level 체크 로직을 보강했습니다.
| def extract_headings(text: str) -> list[Heading]: | |
| """`#` 으로 시작하는 ATX heading 만 추출. setext heading은 다루지 않는다.""" | |
| stripped = strip_code(text) | |
| headings: list[Heading] = [] | |
| for line in stripped.split("\n"): | |
| level = 0 | |
| while level < len(line) and line[level] == "#": | |
| level += 1 | |
| if level < 1 or level > 6: | |
| continue | |
| if level >= len(line): | |
| continue | |
| if line[level] not in (" ", "\t"): | |
| continue | |
| headings.append(Heading(level=level, text=line[level + 1 :].strip())) | |
| return headings | |
| def extract_headings(text: str) -> list[Heading]: | |
| """`#` 으로 시작하는 ATX heading 만 추출. setext heading은 다루지 않는다.""" | |
| stripped = strip_code(text) | |
| headings: list[Heading] = [] | |
| for line in stripped.split("\n"): | |
| line = line.lstrip() | |
| level = 0 | |
| while level < len(line) and line[level] == "#": | |
| level += 1 | |
| if level < 1 or level > 6: | |
| continue | |
| if level >= len(line) or line[level] not in (" ", "\t"): | |
| continue | |
| headings.append(Heading(level=level, text=line[level + 1 :].strip())) | |
| return headings |
| def _is_external_url(href: str) -> bool: | ||
| lower = href.lower() | ||
| return ( | ||
| lower.startswith("http://") | ||
| or lower.startswith("https://") | ||
| or lower.startswith("mailto:") | ||
| ) |
| html_path, | ||
| html_path.read_text(encoding="utf-8"), | ||
| ) | ||
| if f'id="{anchor}"' in html: |
🤖 Augment PR SummarySummary: Refactors the repo so the Docusaurus site (Node) and the docs automation pipeline (Python) have a strict responsibility boundary. Key changes:
Technical notes: Introduces a boundary test to prevent npm scripts from invoking Python or 🤖 Was this summary useful? React with 👍 or 👎 |
| if not url.startswith(prefix): | ||
| return None | ||
| end = url.find("/", len(prefix)) | ||
| return url[len(prefix) : end] if end >= 0 else None |
There was a problem hiding this comment.
docs_version_from_url() returns None for URLs like /docs/13.x (no trailing slash), which is exactly what to_url_path() produces for installation.md. That makes src_version None and skips {{version}} replacement / relative-link version prefixing, which can mis-resolve targets and cause incorrect anchor-validation failures.
Severity: high
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR refactors the docs site and automation tooling by moving translation structure checks and build-artifact validations from Node scripts into a dedicated Python pipeline under .github/docs-updater, and simplifying the site’s npm scripts/e2e setup accordingly.
Changes:
- Migrates translation structure validation, anchor validation, and redirect generation from
scripts/*.mjsinto Python modules with pytest coverage. - Simplifies
package.jsonscripts and updates workflows soupdate-docsbecomes Python-only whiledeployruns Node build + Python post-processing/validation. - Adjusts Playwright to run against the dev server (
npm run start) and updates e2e specs for the new responsibility split.
Reviewed changes
Copilot reviewed 30 out of 30 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/validate-translation-structure.mjs | Removed (validation moved to Python). |
| scripts/validate-anchors.mjs | Removed (validation moved to Python). |
| scripts/sync-versioned-links.mjs | Removed (logic covered in Python main.py). |
| scripts/serve-build.mjs | Minor arg parsing rename for clarity. |
| scripts/markdown-link-utils.mjs | Removed (ported to Python utils). |
| scripts/find-missing-links.mjs | Removed (ported to Python CLI). |
| scripts/find-link-context.mjs | Removed (ported to Python CLI). |
| scripts/create-latest-doc-redirects.mjs | Removed (redirect generation moved to Python). |
| playwright.config.ts | Switches e2e webServer from static build to dev server. |
| package.json | Removes automation hooks; keeps Docusaurus/site-only scripts. |
| e2e/homepage.spec.ts | Adds wait to avoid hydration timing flakiness. |
| e2e/docs-rendering.spec.ts | Replaces hash-scroll assertion with rendered heading presence check. |
| e2e/build.spec.ts | Removed (build existence asserted elsewhere). |
| README.md | Updates workflow responsibility description to match refactor. |
| .github/workflows/update-docs.yml | Removes Node steps; runs Python pipeline only. |
| .github/workflows/deploy.yml | Adds uv/Python steps; runs Python redirect + anchor validation post-build. |
| .github/docs-updater/tests/test_structure_validator.py | New pytest coverage for structure validation parity. |
| .github/docs-updater/tests/test_project_boundaries.py | New test enforcing no Python/.github calls from npm scripts. |
| .github/docs-updater/tests/test_markdown_link_utils.py | New pytest coverage for markdown parsing utilities. |
| .github/docs-updater/tests/test_main.py | Adds regression tests for latest-stable API link handling. |
| .github/docs-updater/tests/test_build_redirect_generator.py | New pytest coverage for redirect generation output. |
| .github/docs-updater/tests/test_build_anchor_validator.py | New pytest coverage for built-anchor validation. |
| .github/docs-updater/structure_validator.py | New Python implementation of translation structure validation + reporting. |
| .github/docs-updater/markdown_link_utils.py | New Python markdown parsing utilities (ported from Node). |
| .github/docs-updater/main.py | Integrates structure validation + latest-stable sidebar behavior. |
| .github/docs-updater/find_missing_links.py | New Python CLI for missing/extra link debugging. |
| .github/docs-updater/find_link_context.py | New Python CLI for finding missing link context. |
| .github/docs-updater/build_redirect_generator.py | New Python build redirect generator for latest stable docs. |
| .github/docs-updater/build_anchor_validator.py | New Python validator for markdown anchors vs built HTML ids. |
| .github/docs-updater/.ai-context/workflow.md | Expanded documentation describing the new responsibility split. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def to_url_path(docs_root: Path, md_path: Path) -> str: | ||
| parts = md_path.relative_to(docs_root).parts | ||
| version = parts[0].removeprefix("version-") | ||
| tail = "/".join(parts[1:])[:-3] | ||
| if tail == "installation": | ||
| return f"/docs/{version}" | ||
| return f"/docs/{version}/{tail}" |
| if not url.startswith(prefix): | ||
| return None | ||
| end = url.find("/", len(prefix)) | ||
| return url[len(prefix) : end] if end >= 0 else None |
| def test_to_url_path_treats_installation_as_version_root(tmp_path: Path): | ||
| docs_root = tmp_path / "versioned_docs" | ||
| md = docs_root / "version-13.x" / "installation.md" | ||
| md.parent.mkdir(parents=True) | ||
| md.write_text("# Installation\n", encoding="utf-8") | ||
| assert bav.to_url_path(docs_root, md) == "/docs/13.x" |
| if part.startswith("version-"): | ||
| return part[len("version-") :] |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e4dde103da
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if not url.startswith(prefix): | ||
| return None | ||
| end = url.find("/", len(prefix)) | ||
| return url[len(prefix) : end] if end >= 0 else None |
There was a problem hiding this comment.
Handle version root URLs in docs version parser
docs_version_from_url returns None for URLs like /docs/13.x because it only succeeds when another / exists after the version segment. In this commit to_url_path now emits exactly /docs/<version> for installation.md, so any relative anchor link from that page (for example requests#...) is resolved without a version prefix and is reported as missing HTML even when build/docs/<version>/requests(.html|/index.html) exists. This creates false failures in deploy-time anchor validation for installation-page links.
Useful? React with 👍 / 👎.


Summary
scripts/의.mjs와package.jsonhook 으로 섞여 있던 번역 자동화 / 사이트 빌드 도구를 두 책임 영역으로 분리합니다.package.json에서 prebuild/postbuild/validate-anchors/sync:versions 제거..github/docs-updater, Python) : 번역, 사이드바, 번역 구조 검증, 빌드 산출물 anchor 검증, 미버전 경로 redirect HTML 생성을 담당.update-docs.yml은 Python only,deploy.yml은 Node typecheck/build 후 Python redirect 생성 / anchor 검증 → Pages 업로드 / 배포.주요 변경
scripts/markdown-link-utils.mjs,validate-translation-structure.mjs,find-link-context.mjs,find-missing-links.mjs→ Python 으로 이관.main.py마지막 단계가structure_validator호출.scripts/create-latest-doc-redirects.mjs,validate-anchors.mjs,sync-versioned-links.mjs→build_redirect_generator.py/build_anchor_validator.py로 이관. master 사이드바 API 링크 정규화는parse_documentation_md의latest_stable처리로 흡수.scripts/serve-build.mjs가13.x점 디렉터리, trailing slash,.htmlcleanUrls fallback 모두 처리.playwright.config.ts.webServer가npm run start(dev 서버) 위에서 동작.e2e/build.spec.ts제거 (build/ 존재는 build 명령 자체가 보장). hash scroll 검증은 dev 서버 한계로 anchor 매핑 검증으로 변경..github/docs-updater/.ai-context/workflow.md가 책임 분리 모델·mermaid 흐름·로컬 검증 명령 명시.tests/test_project_boundaries.py가 npm script 안에.github//python호출이 있으면 실패.검증 결과
cd .github/docs-updater && uv run pytest -q→ 85 passednpm run typecheck -- --pretty false→ 0 errorsnpm run build→ 0 warning, 0 errorcd .github/docs-updater && uv run python build_redirect_generator.py→ 101 redirectscd .github/docs-updater && uv run python build_anchor_validator.py→ 23250/23250 OKnpm run test:e2e→ 70 passed/docs/13.x///docs/13.x/upgrade#upgrade-13.0//docs/pulse//en/docs/pulse모두 콘솔 0건, anchor·redirect 정상Test plan
cd .github/docs-updater && uv run pytest -qnpm run typecheck -- --pretty falsenpm run buildcd .github/docs-updater && uv run python build_redirect_generator.pycd .github/docs-updater && uv run python build_anchor_validator.pynpm run test:e2e