Skip to content

COMP: Add patch_dynamic_description() to v4 ingest sanitizer#6221

Closed
hjmjohnson wants to merge 15 commits intoInsightSoftwareConsortium:mainfrom
hjmjohnson:update-6204-description-fix
Closed

COMP: Add patch_dynamic_description() to v4 ingest sanitizer#6221
hjmjohnson wants to merge 15 commits intoInsightSoftwareConsortium:mainfrom
hjmjohnson:update-6204-description-fix

Conversation

@hjmjohnson
Copy link
Copy Markdown
Member

COMP: Add patch_dynamic_description() to strip file(READ README.md) preamble

After ingest, the in-tree README.md is the archival migration notice.
It contains semicolons and [ characters that CMake list expansion splits
into spurious itk_module() arguments, producing configure warnings.
This patch adds automatic stripping of the dynamic-DESCRIPTION pattern
during history rewrite so future ingests never introduce it.

Companion to ITK PR #6220 which fixes the five already-ingested modules
(RLEImage, SplitComponents, IOFDF, IOMeshMZ3, IOMeshSTL) that exhibit
the warning today.

Changes
  • sanitize-history.py: add patch_dynamic_description() — strips the
    get_filename_component / file(READ README.md DOCUMENTATION) preamble
    and replaces DESCRIPTION "${DOCUMENTATION}" with
    DESCRIPTION "Module ingested from upstream." during cmake blob
    processing. Counter added to the run summary.

  • INGESTION_STRATEGY_v4.md: add a row to the "Patterns Phase A now
    sanitizes automatically" table documenting the pattern and linking to
    the five affected modules.

hjmjohnson and others added 15 commits May 4, 2026 13:31
Splits the v3 single-driver flow into two independent phases so the
upstream-archival publish step never runs on a working tree that
filter-repo has rewritten:

  - Phase A (ingest-module-v4.sh): local, throwaway clone; rewrites
    history into Modules/<Group>/<Module>; never touches the upstream
    remote.

  - Phase B (archive-remote-module.sh): fresh shallow clone; pushes
    one removal commit + promotes MIGRATION_README.md to README.md;
    flips archived=true.  Gated on the Phase A PR being merged.

The shared sanitize-history.py walks the rewritten branch and applies
the same hooks ITK's .pre-commit-config.yaml enforces:

  - clang-format on every C/C++ blob (content sniff)
  - black on every Python blob (content sniff)
  - gersemi on every CMake blob using ITK's .gersemi.config
  - trailing-whitespace, end-of-file-fixer, mixed-line-ending on
    every text-classified blob
  - VTK / SVG / hash-content sidecar files are skipped intact
  - Every commit subject without an ITK prefix gets one heuristically
    auto-prepended; decisions are logged for operator audit

filter-repo's blob_callback is filename-blind by design, so excludes
are approximated via content sniffing.  IOMeshSTL whitelist is the
first concrete example; smoke-tested via --dry-run successfully
(143->89 commits, 44->24 merges preserved, 108/149 blobs reformatted).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The whitelist admits whole directories (test/, wrapping/, ...) but
some upstream remote modules place CI / packaging scaffolding inside
those (e.g. test/Docker/, .github/, azure-pipelines/).  Without a
deny-pass after the whitelist filter, those leak into ITK history.

Mirrors v3's --invert-paths --path-glob pass; discovered missing on
the first Cuberille v4 ingest (test/Docker/{Dockerfile,build,run,test}.sh
leaked).
Fixes the four classes of ghostflow errors that surfaced on the first
Cuberille v4 ingest (PR InsightSoftwareConsortium#6205) — the only allowable remaining error
should be the unavoidable upstream root commit.

  - Strip *.orig / *.rej / *.BACKUP.* / *.LOCAL.* / *.REMOTE.* /
    *.BASE.* merge-conflict-artifact files from history (catches
    leftover-conflict-marker errors like the .h.orig on Cuberille).
  - sanitize-history.py: cap every commit subject at 78 chars (ghostflow
    / kw-commit-msg rule); excess words move into the body.
  - sanitize-history.py: insert a blank line between subject and body
    when the original commit message lacked one (two-line subject error).
  - sanitize-history.py: clear the executable bit (mode 100755 -> 100644)
    on text-file extensions so root-commit "executable permissions but
    file does not look executable" stops firing.
Many ITK remote modules' itk-module.cmake reads
  file(READ "${MY_CURRENT_DIR}/README.rst" DOCUMENTATION)
to populate the CMake DESCRIPTION variable from the upstream README.

The v4 whitelist intentionally excludes the upstream README (the
operator ships a new README.md as part of the ingest PR), so without
this rewrite every commit's CMakeLists fails to configure post-ingest
until the operator manually patches itk-module.cmake.

sanitize-history.py now rewrites the file(READ ... README.rst ...)
reference to README.md as part of the cmake-blob transform, so every
historical commit on the rewritten branch is independently buildable.

Surfaced on the IOMeshSTL ingest (PR following PR InsightSoftwareConsortium#6204).
…terns

Add a 'Known artifacts at PR-review time' section to capture two
things future ingest operators need to expect but that v3/v4 do not
fix automatically:

  - The single ghostflow-check-main 'root commit not allowed' error
    that every Mode-A merge produces.  Maintainers override at merge.
    Calling it out keeps operators from chasing a non-fix.

  - A short table of code-level patterns Greptile has flagged
    post-ingest on multiple modules (IOMeshSTL InsightSoftwareConsortium#6206, RLEImage InsightSoftwareConsortium#6208):
    signed-vs-unsigned size types, missing override, dead return after
    itkExceptionMacro, stray test args, external-friend declarations,
    GetBuffer-const correctness, and allocator-init invariants.

The table is the durable channel for cross-ingest review knowledge —
extend it as new recurring patterns surface.
ExternalData stores resolved fetched content as .ExternalData_<algo>_<hash>
files alongside the .cid / .md5 sidecars they correspond to.  These
are local fetch cache, not source — ExternalData regenerates them on
demand at consumer-time.  Upstream remote modules sometimes commit
them inadvertently from a CTest run inside their working tree (e.g.
ITKIOMeshSTL had four such files in test/Baseline/, flagged by
@dzenanz on PR InsightSoftwareConsortium#6206).

Add **/.ExternalData_* to the scaffolding deny-pass so they never
enter ITK history.
Upstream remote modules wrap their top-level CMakeLists.txt in

    if(NOT ITK_SOURCE_DIR)
      find_package(ITK REQUIRED)
      list(APPEND CMAKE_MODULE_PATH ${ITK_CMAKE_DIR})
      include(ITKModuleExternal)
    else()
      itk_module_impl()
    endif()

so the same file works both as a fetched remote module and as a
stand-alone CMake project.  In an in-tree ingested module
ITK_SOURCE_DIR is always defined, so the if-branch is dead code;
@dzenanz flagged it on PR InsightSoftwareConsortium#6206 (IOMeshSTL).

Add patch_standalone_build_guard() that detects the idiom (allowing
2- or 4-space indent variants) and replaces it with a bare
itk_module_impl() before gersemi formatting runs.  Track count via
sanitizer.standalone_guard_patches and print it in the run summary.
The 'Greptile post-ingest patterns' table mixed mechanical artifacts
(strippable in Phase A) with semantic concerns (require human
judgment).  Split into two sections so the operator's checklist
shrinks as the sanitizer learns more rules.

Promote .ExternalData_* and the standalone-build CMake guard to the
auto-sanitized table now that the previous two commits handle them.
Upstream remote modules typically begin their top-level CMakeLists.txt
with their own cmake_minimum_required(VERSION X.Y.Z) declaration, often
3.10.2 or earlier.  ITK's top-level CMakeLists pins a higher minimum,
so the per-module line is redundant and frequently *lower* than ITK's
floor.  @dzenanz flagged it on every ingest with the comment 'Version
behind, not needed.' — most recently on PR InsightSoftwareConsortium#6215 (IOFDF).

Add patch_drop_cmake_minimum_required() that detects the idiom and
removes the matching line via a multiline regex.  Track count via
sanitizer.cmake_min_required_drops and print it in the run summary.
Apply alongside patch_standalone_build_guard() in the cmake-blob
branch of blob_callback.
…cmake

Archival README.md files contain semicolons and bracket characters.
CMake's foreach(arg \${ARGN}) splits on semicolons, so reading them
into a DESCRIPTION argument produces CMake (dev) configure warnings
for every module ingested via v4 (observed: RLEImage, SplitComponents,
IOFDF, IOMeshMZ3, IOMeshSTL).

Add patch_dynamic_description() to sanitize-history.py to strip the
get_filename_component/file(READ README.md) preamble and replace
DESCRIPTION "\${DOCUMENTATION}" with a static one-liner during history
rewrite.  Document the pattern in INGESTION_STRATEGY_v4.md.
@github-actions github-actions Bot added type:Compiler Compiler support or related warnings type:Infrastructure Infrastructure/ecosystem related changes, such as CMake or buildbots area:Python wrapping Python bindings for a class labels May 6, 2026
@hjmjohnson
Copy link
Copy Markdown
Member Author

Closing — the single commit on this branch (396f1044 → cherry-picked as c7768a53) has been folded into PR #6204 directly. This PR should not have been opened as a separate one; the change belongs on ingest-strategy-v4.

PR #6204 head is now c7768a5377aaa16593aa1f6ffdcb38a764d700e3.

@hjmjohnson hjmjohnson closed this May 6, 2026
@hjmjohnson hjmjohnson deleted the update-6204-description-fix branch May 6, 2026 23:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:Python wrapping Python bindings for a class type:Compiler Compiler support or related warnings type:Infrastructure Infrastructure/ecosystem related changes, such as CMake or buildbots

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant