Skip to content

refactor: split site (Node) and automation (Python) responsibilities#63

Draft
kimchanhyung98 wants to merge 4 commits intodevelopfrom
chore/docs-updater
Draft

refactor: split site (Node) and automation (Python) responsibilities#63
kimchanhyung98 wants to merge 4 commits intodevelopfrom
chore/docs-updater

Conversation

@kimchanhyung98
Copy link
Copy Markdown
Member

@kimchanhyung98 kimchanhyung98 commented May 2, 2026

Summary

scripts/.mjspackage.json hook 으로 섞여 있던 번역 자동화 / 사이트 빌드 도구를 두 책임 영역으로 분리합니다.

  • 사이트(Docusaurus) : 메인페이지(홈) 와 정적 빌드만 책임. package.json 에서 prebuild/postbuild/validate-anchors/sync:versions 제거.
  • 자동화(.github/docs-updater, Python) : 번역, 사이드바, 번역 구조 검증, 빌드 산출물 anchor 검증, 미버전 경로 redirect HTML 생성을 담당.
  • 워크플로우 : update-docs.yml 은 Python only, deploy.yml 은 Node typecheck/build 후 Python redirect 생성 / anchor 검증 → Pages 업로드 / 배포.

주요 변경

  • scripts/markdown-link-utils.mjs, validate-translation-structure.mjs, find-link-context.mjs, find-missing-links.mjs → Python 으로 이관. main.py 마지막 단계가 structure_validator 호출.
  • scripts/create-latest-doc-redirects.mjs, validate-anchors.mjs, sync-versioned-links.mjsbuild_redirect_generator.py / build_anchor_validator.py 로 이관. master 사이드바 API 링크 정규화는 parse_documentation_mdlatest_stable 처리로 흡수.
  • scripts/serve-build.mjs13.x 점 디렉터리, trailing slash, .html cleanUrls fallback 모두 처리.
  • playwright.config.ts.webServernpm run start (dev 서버) 위에서 동작. e2e/build.spec.ts 제거 (build/ 존재는 build 명령 자체가 보장). hash scroll 검증은 dev 서버 한계로 anchor 매핑 검증으로 변경.
  • .github/docs-updater/.ai-context/workflow.md 가 책임 분리 모델·mermaid 흐름·로컬 검증 명령 명시. tests/test_project_boundaries.py 가 npm script 안에 .github/ / python 호출이 있으면 실패.

검증 결과

  • cd .github/docs-updater && uv run pytest -q → 85 passed
  • npm run typecheck -- --pretty false → 0 errors
  • npm run build → 0 warning, 0 error
  • cd .github/docs-updater && uv run python build_redirect_generator.py → 101 redirects
  • cd .github/docs-updater && uv run python build_anchor_validator.py → 23250/23250 OK
  • npm run test:e2e → 70 passed
  • 브라우저 smoke (prod 빌드) : 홈 / /docs/13.x/ / /docs/13.x/upgrade#upgrade-13.0 / /docs/pulse / /en/docs/pulse 모두 콘솔 0건, anchor·redirect 정상

Test plan

  • cd .github/docs-updater && uv run pytest -q
  • npm run typecheck -- --pretty false
  • npm run build
  • cd .github/docs-updater && uv run python build_redirect_generator.py
  • cd .github/docs-updater && uv run python build_anchor_validator.py
  • npm run test:e2e
  • 브라우저 spot check (홈 / docs root / upgrade anchor / unversioned redirect)

kimchanhyung98 and others added 4 commits May 2, 2026 15:56
Move scripts/markdown-link-utils.mjs and the translation-only validators
into .github/docs-updater so the update-docs workflow can fail before
committing translated output. structure_validator now runs at the end of
main.py with the same JS semantics (anchor / heading / internal-link
diff). Debugging CLIs (find_link_context, find_missing_links) move with
their dependency.

Co-Authored-By: Claude <noreply@anthropic.com>
build_redirect_generator generates the unversioned -> latest stable
redirect HTML for each locale (both /docs/<slug>/index.html and
/docs/<slug>.html shapes). build_anchor_validator checks that every
markdown #fragment in versioned_docs/ resolves to an actual id in the
built HTML, including the .html cleanUrls variant. deploy.yml now sets
up uv alongside Node and runs both tools after npm run build, while
update-docs.yml is Node-free and Python only. Also drops the matching
scripts/*.mjs and the prebuild/postbuild hook in package.json.

Co-Authored-By: Claude <noreply@anthropic.com>
Drop prebuild/postbuild hooks, sync:versions, and validate-anchors from
package.json so site tooling no longer reaches into the automation
domain. serve still ships its own static server (handles the dotted
13.x directory and .html cleanUrls fallback) and serve:docusaurus stays
as the upstream baseline. playwright.config.ts now boots npm run start
for the e2e webServer; tests that depend on prod-build behaviour either
verify anchor mapping directly (docs-rendering) or wait for hydration
(homepage), and the build-output existence assertions move out (build
exit code is the source of truth).

Co-Authored-By: Claude <noreply@anthropic.com>
workflow.md spells out the responsibility boundary between the Node
site (hosting + landing page) and the Python automation
(.github/docs-updater + workflows). README mirrors the deploy/update
split so contributors hit the right pipeline. test_project_boundaries
fails if any npm script ever calls into .github/ or python so the
divide cannot regress silently.

Co-Authored-By: Claude <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 2, 2026 07:12
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented May 2, 2026

Quality Gate Failed Quality Gate failed

Failed conditions
1 Security Hotspot

See analysis details on SonarQube Cloud

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the documentation update and deployment workflow by migrating legacy Node.js scripts to a Python-based automation suite. The changes introduce robust tools for anchor validation, redirect generation, and structural consistency checks between source and translated documents. Review feedback highlights opportunities to improve Markdown parsing for nested brackets, refine anchor and heading extraction using regular expressions to prevent false positives, and broaden external URL detection to support additional protocols.

Comment on lines +91 to +123
def extract_markdown_links(text: str) -> list[MarkdownLink]:
"""`[label](url)` 형태의 링크를 모두 추출. fenced/inline code 안은 무시."""
stripped = strip_code(text)
links: list[MarkdownLink] = []
i = 0
length = len(stripped)
while i < length:
label_start = stripped.find("[", i)
if label_start < 0:
break

label_end = stripped.find("]", label_start + 1)
if label_end < 0:
break

if label_end + 1 >= length or stripped[label_end + 1] != "(":
i = label_end + 1
continue

url_end = stripped.find(")", label_end + 2)
if url_end < 0:
break

url = _strip_title_suffix(stripped[label_end + 2 : url_end])
if url:
links.append(
MarkdownLink(
text=stripped[label_start + 1 : label_end],
url=url,
)
)
i = url_end + 1
return links
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

현재의 루프 기반 파싱 로직은 [label with [nested] brackets](url)와 같이 링크 라벨 내부에 대괄호가 포함된 경우를 올바르게 처리하지 못합니다(첫 번째 ]에서 라벨이 끝난 것으로 간주함). 정규표현식을 사용하면 중첩된 대괄호(한 단계 수준)를 포함한 링크를 더 안정적으로 추출할 수 있습니다.

def extract_markdown_links(text: str) -> list[MarkdownLink]:
    """`[label](url)` 형태의 링크를 모두 추출. fenced/inline code 안은 무시."""
    stripped = strip_code(text)
    links: list[MarkdownLink] = []
    # 중첩된 대괄호를 한 단계까지 허용하는 정규표현식
    pattern = r'\[((?:[^\[\]]|\[[^\[\]]*\])*)\]\(([^)]+)\)'
    for match in re.finditer(pattern, stripped):
        label, raw_url = match.groups()
        url = _strip_title_suffix(raw_url)
        if url:
            links.append(MarkdownLink(text=label, url=url))
    return links

Comment on lines +51 to +78
def extract_anchors(text: str) -> list[str]:
"""`<a name="...">` 명시적 앵커를 코드 영역을 제외하고 추출."""
anchors: list[str] = []
stripped = strip_code(text)
index = 0
length = len(stripped)
while index < length:
tag_start = stripped.find("<a", index)
if tag_start < 0:
break

tag_end = stripped.find(">", tag_start + 2)
if tag_end < 0:
break

tag = stripped[tag_start : tag_end + 1]
name_pos = tag.find("name=")
if name_pos >= 0:
quote_index = name_pos + len("name=")
if quote_index < len(tag):
quote = tag[quote_index]
if quote in ('"', "'"):
value_start = quote_index + 1
value_end = tag.find(quote, value_start)
if value_end >= 0:
anchors.append(tag[value_start:value_end])
index = tag_end + 1
return anchors
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

extract_anchors 함수에서 stripped.find("<a", index)를 사용하면 <area>, <address><a> 태그가 아닌 요소와도 매칭될 수 있습니다. 또한 name= 검색 시 data-name= 등 의도하지 않은 속성이 포함될 위험이 있습니다. 정규표현식을 사용하여 정확히 <a> 태그의 name 속성만 추출하는 것이 안전합니다.

Suggested change
def extract_anchors(text: str) -> list[str]:
"""`<a name="...">` 명시적 앵커를 코드 영역을 제외하고 추출."""
anchors: list[str] = []
stripped = strip_code(text)
index = 0
length = len(stripped)
while index < length:
tag_start = stripped.find("<a", index)
if tag_start < 0:
break
tag_end = stripped.find(">", tag_start + 2)
if tag_end < 0:
break
tag = stripped[tag_start : tag_end + 1]
name_pos = tag.find("name=")
if name_pos >= 0:
quote_index = name_pos + len("name=")
if quote_index < len(tag):
quote = tag[quote_index]
if quote in ('"', "'"):
value_start = quote_index + 1
value_end = tag.find(quote, value_start)
if value_end >= 0:
anchors.append(tag[value_start:value_end])
index = tag_end + 1
return anchors
def extract_anchors(text: str) -> list[str]:
"""<a name=\"...\"> 명시적 앵커를 코드 영역을 제외하고 추출."""
import re
stripped = strip_code(text)
return re.findall(r'<a\s+[^>]*name=["\']([^"\']+)["\']', stripped, re.IGNORECASE)

Comment on lines +87 to +102
def extract_headings(text: str) -> list[Heading]:
"""`#` 으로 시작하는 ATX heading 만 추출. setext heading은 다루지 않는다."""
stripped = strip_code(text)
headings: list[Heading] = []
for line in stripped.split("\n"):
level = 0
while level < len(line) and line[level] == "#":
level += 1
if level < 1 or level > 6:
continue
if level >= len(line):
continue
if line[level] not in (" ", "\t"):
continue
headings.append(Heading(level=level, text=line[level + 1 :].strip()))
return headings
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

extract_headings 함수가 행 시작 부분의 공백(Indentation)을 고려하지 않습니다. Markdown 명세(CommonMark)에 따르면 헤딩 앞에 최대 3개의 공백이 허용되므로, lstrip()을 사용하여 이를 처리하는 것이 좋습니다. 또한 인덱스 범위를 벗어나는 에러를 방지하기 위해 level 체크 로직을 보강했습니다.

Suggested change
def extract_headings(text: str) -> list[Heading]:
"""`#` 으로 시작하는 ATX heading 만 추출. setext heading은 다루지 않는다."""
stripped = strip_code(text)
headings: list[Heading] = []
for line in stripped.split("\n"):
level = 0
while level < len(line) and line[level] == "#":
level += 1
if level < 1 or level > 6:
continue
if level >= len(line):
continue
if line[level] not in (" ", "\t"):
continue
headings.append(Heading(level=level, text=line[level + 1 :].strip()))
return headings
def extract_headings(text: str) -> list[Heading]:
"""`#` 으로 시작하는 ATX heading 만 추출. setext heading은 다루지 않는다."""
stripped = strip_code(text)
headings: list[Heading] = []
for line in stripped.split("\n"):
line = line.lstrip()
level = 0
while level < len(line) and line[level] == "#":
level += 1
if level < 1 or level > 6:
continue
if level >= len(line) or line[level] not in (" ", "\t"):
continue
headings.append(Heading(level=level, text=line[level + 1 :].strip()))
return headings

Comment on lines +95 to +101
def _is_external_url(href: str) -> bool:
lower = href.lower()
return (
lower.startswith("http://")
or lower.startswith("https://")
or lower.startswith("mailto:")
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

_is_external_url 함수에서 // (프로토콜 상대 경로), javascript:, tel: 등 다양한 외부 링크 스키마를 체크하도록 보강하는 것이 좋습니다.

def _is_external_url(href: str) -> bool:
    lower = href.lower()
    return lower.startswith(("http://", "https://", "mailto:", "tel:", "javascript:", "//"))

html_path,
html_path.read_text(encoding="utf-8"),
)
if f'id="{anchor}"' in html:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

f'id="{anchor}"' in html 방식은 실제 HTML 요소의 ID 속성뿐만 아니라 코드 예제나 주석 내의 텍스트와도 매칭될 수 있어 위양성(False Positive)이 발생할 가능성이 있습니다. 정규표현식을 사용하여 최소한 id="..." 형태의 속성인지 확인하는 로직을 보강하면 더 정확한 검증이 가능합니다.

@augmentcode
Copy link
Copy Markdown

augmentcode Bot commented May 2, 2026

🤖 Augment PR Summary

Summary: Refactors the repo so the Docusaurus site (Node) and the docs automation pipeline (Python) have a strict responsibility boundary.

Key changes:

  • Moves translation-structure validation and build-output processing (redirect generation + built-HTML anchor validation) into .github/docs-updater Python tools.
  • Updates update-docs.yml to run Python-only (uv + pytest + main.py) and commit only updater outputs.
  • Updates deploy.yml to run Node typecheck/build, then run Python redirect generation and built-site anchor validation post-build.
  • Simplifies package.json by removing build hooks and other automation scripts to avoid site→automation coupling.
  • Adjusts Playwright to run against the Docusaurus dev server; updates e2e assertions accordingly.
  • Adds Python test coverage for markdown link utils, structure validation, redirect generation, anchor validation, and boundary enforcement.

Technical notes: Introduces a boundary test to prevent npm scripts from invoking Python or .github/ automation, and documents the new workflow model in .ai-context/workflow.md.

🤖 Was this summary useful? React with 👍 or 👎

Copy link
Copy Markdown

@augmentcode augmentcode Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 1 suggestion posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

if not url.startswith(prefix):
return None
end = url.find("/", len(prefix))
return url[len(prefix) : end] if end >= 0 else None
Copy link
Copy Markdown

@augmentcode augmentcode Bot May 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs_version_from_url() returns None for URLs like /docs/13.x (no trailing slash), which is exactly what to_url_path() produces for installation.md. That makes src_version None and skips {{version}} replacement / relative-link version prefixing, which can mis-resolve targets and cause incorrect anchor-validation failures.

Severity: high

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR refactors the docs site and automation tooling by moving translation structure checks and build-artifact validations from Node scripts into a dedicated Python pipeline under .github/docs-updater, and simplifying the site’s npm scripts/e2e setup accordingly.

Changes:

  • Migrates translation structure validation, anchor validation, and redirect generation from scripts/*.mjs into Python modules with pytest coverage.
  • Simplifies package.json scripts and updates workflows so update-docs becomes Python-only while deploy runs Node build + Python post-processing/validation.
  • Adjusts Playwright to run against the dev server (npm run start) and updates e2e specs for the new responsibility split.

Reviewed changes

Copilot reviewed 30 out of 30 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
scripts/validate-translation-structure.mjs Removed (validation moved to Python).
scripts/validate-anchors.mjs Removed (validation moved to Python).
scripts/sync-versioned-links.mjs Removed (logic covered in Python main.py).
scripts/serve-build.mjs Minor arg parsing rename for clarity.
scripts/markdown-link-utils.mjs Removed (ported to Python utils).
scripts/find-missing-links.mjs Removed (ported to Python CLI).
scripts/find-link-context.mjs Removed (ported to Python CLI).
scripts/create-latest-doc-redirects.mjs Removed (redirect generation moved to Python).
playwright.config.ts Switches e2e webServer from static build to dev server.
package.json Removes automation hooks; keeps Docusaurus/site-only scripts.
e2e/homepage.spec.ts Adds wait to avoid hydration timing flakiness.
e2e/docs-rendering.spec.ts Replaces hash-scroll assertion with rendered heading presence check.
e2e/build.spec.ts Removed (build existence asserted elsewhere).
README.md Updates workflow responsibility description to match refactor.
.github/workflows/update-docs.yml Removes Node steps; runs Python pipeline only.
.github/workflows/deploy.yml Adds uv/Python steps; runs Python redirect + anchor validation post-build.
.github/docs-updater/tests/test_structure_validator.py New pytest coverage for structure validation parity.
.github/docs-updater/tests/test_project_boundaries.py New test enforcing no Python/.github calls from npm scripts.
.github/docs-updater/tests/test_markdown_link_utils.py New pytest coverage for markdown parsing utilities.
.github/docs-updater/tests/test_main.py Adds regression tests for latest-stable API link handling.
.github/docs-updater/tests/test_build_redirect_generator.py New pytest coverage for redirect generation output.
.github/docs-updater/tests/test_build_anchor_validator.py New pytest coverage for built-anchor validation.
.github/docs-updater/structure_validator.py New Python implementation of translation structure validation + reporting.
.github/docs-updater/markdown_link_utils.py New Python markdown parsing utilities (ported from Node).
.github/docs-updater/main.py Integrates structure validation + latest-stable sidebar behavior.
.github/docs-updater/find_missing_links.py New Python CLI for missing/extra link debugging.
.github/docs-updater/find_link_context.py New Python CLI for finding missing link context.
.github/docs-updater/build_redirect_generator.py New Python build redirect generator for latest stable docs.
.github/docs-updater/build_anchor_validator.py New Python validator for markdown anchors vs built HTML ids.
.github/docs-updater/.ai-context/workflow.md Expanded documentation describing the new responsibility split.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +61 to +67
def to_url_path(docs_root: Path, md_path: Path) -> str:
parts = md_path.relative_to(docs_root).parts
version = parts[0].removeprefix("version-")
tail = "/".join(parts[1:])[:-3]
if tail == "installation":
return f"/docs/{version}"
return f"/docs/{version}/{tail}"
if not url.startswith(prefix):
return None
end = url.find("/", len(prefix))
return url[len(prefix) : end] if end >= 0 else None
Comment on lines +8 to +13
def test_to_url_path_treats_installation_as_version_root(tmp_path: Path):
docs_root = tmp_path / "versioned_docs"
md = docs_root / "version-13.x" / "installation.md"
md.parent.mkdir(parents=True)
md.write_text("# Installation\n", encoding="utf-8")
assert bav.to_url_path(docs_root, md) == "/docs/13.x"
Comment on lines +144 to +145
if part.startswith("version-"):
return part[len("version-") :]
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e4dde103da

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

if not url.startswith(prefix):
return None
end = url.find("/", len(prefix))
return url[len(prefix) : end] if end >= 0 else None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle version root URLs in docs version parser

docs_version_from_url returns None for URLs like /docs/13.x because it only succeeds when another / exists after the version segment. In this commit to_url_path now emits exactly /docs/<version> for installation.md, so any relative anchor link from that page (for example requests#...) is resolved without a version prefix and is reported as missing HTML even when build/docs/<version>/requests(.html|/index.html) exists. This creates false failures in deploy-time anchor validation for installation-page links.

Useful? React with 👍 / 👎.

@kimchanhyung98 kimchanhyung98 marked this pull request as draft May 2, 2026 07:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants