COMP: Add patch_dynamic_description() to v4 ingest sanitizer#6221
Closed
hjmjohnson wants to merge 15 commits intoInsightSoftwareConsortium:mainfrom
Closed
COMP: Add patch_dynamic_description() to v4 ingest sanitizer#6221hjmjohnson wants to merge 15 commits intoInsightSoftwareConsortium:mainfrom
hjmjohnson wants to merge 15 commits intoInsightSoftwareConsortium:mainfrom
Conversation
Splits the v3 single-driver flow into two independent phases so the
upstream-archival publish step never runs on a working tree that
filter-repo has rewritten:
- Phase A (ingest-module-v4.sh): local, throwaway clone; rewrites
history into Modules/<Group>/<Module>; never touches the upstream
remote.
- Phase B (archive-remote-module.sh): fresh shallow clone; pushes
one removal commit + promotes MIGRATION_README.md to README.md;
flips archived=true. Gated on the Phase A PR being merged.
The shared sanitize-history.py walks the rewritten branch and applies
the same hooks ITK's .pre-commit-config.yaml enforces:
- clang-format on every C/C++ blob (content sniff)
- black on every Python blob (content sniff)
- gersemi on every CMake blob using ITK's .gersemi.config
- trailing-whitespace, end-of-file-fixer, mixed-line-ending on
every text-classified blob
- VTK / SVG / hash-content sidecar files are skipped intact
- Every commit subject without an ITK prefix gets one heuristically
auto-prepended; decisions are logged for operator audit
filter-repo's blob_callback is filename-blind by design, so excludes
are approximated via content sniffing. IOMeshSTL whitelist is the
first concrete example; smoke-tested via --dry-run successfully
(143->89 commits, 44->24 merges preserved, 108/149 blobs reformatted).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The whitelist admits whole directories (test/, wrapping/, ...) but
some upstream remote modules place CI / packaging scaffolding inside
those (e.g. test/Docker/, .github/, azure-pipelines/). Without a
deny-pass after the whitelist filter, those leak into ITK history.
Mirrors v3's --invert-paths --path-glob pass; discovered missing on
the first Cuberille v4 ingest (test/Docker/{Dockerfile,build,run,test}.sh
leaked).
Fixes the four classes of ghostflow errors that surfaced on the first Cuberille v4 ingest (PR InsightSoftwareConsortium#6205) — the only allowable remaining error should be the unavoidable upstream root commit. - Strip *.orig / *.rej / *.BACKUP.* / *.LOCAL.* / *.REMOTE.* / *.BASE.* merge-conflict-artifact files from history (catches leftover-conflict-marker errors like the .h.orig on Cuberille). - sanitize-history.py: cap every commit subject at 78 chars (ghostflow / kw-commit-msg rule); excess words move into the body. - sanitize-history.py: insert a blank line between subject and body when the original commit message lacked one (two-line subject error). - sanitize-history.py: clear the executable bit (mode 100755 -> 100644) on text-file extensions so root-commit "executable permissions but file does not look executable" stops firing.
Many ITK remote modules' itk-module.cmake reads
file(READ "${MY_CURRENT_DIR}/README.rst" DOCUMENTATION)
to populate the CMake DESCRIPTION variable from the upstream README.
The v4 whitelist intentionally excludes the upstream README (the
operator ships a new README.md as part of the ingest PR), so without
this rewrite every commit's CMakeLists fails to configure post-ingest
until the operator manually patches itk-module.cmake.
sanitize-history.py now rewrites the file(READ ... README.rst ...)
reference to README.md as part of the cmake-blob transform, so every
historical commit on the rewritten branch is independently buildable.
Surfaced on the IOMeshSTL ingest (PR following PR InsightSoftwareConsortium#6204).
…terns
Add a 'Known artifacts at PR-review time' section to capture two
things future ingest operators need to expect but that v3/v4 do not
fix automatically:
- The single ghostflow-check-main 'root commit not allowed' error
that every Mode-A merge produces. Maintainers override at merge.
Calling it out keeps operators from chasing a non-fix.
- A short table of code-level patterns Greptile has flagged
post-ingest on multiple modules (IOMeshSTL InsightSoftwareConsortium#6206, RLEImage InsightSoftwareConsortium#6208):
signed-vs-unsigned size types, missing override, dead return after
itkExceptionMacro, stray test args, external-friend declarations,
GetBuffer-const correctness, and allocator-init invariants.
The table is the durable channel for cross-ingest review knowledge —
extend it as new recurring patterns surface.
ExternalData stores resolved fetched content as .ExternalData_<algo>_<hash> files alongside the .cid / .md5 sidecars they correspond to. These are local fetch cache, not source — ExternalData regenerates them on demand at consumer-time. Upstream remote modules sometimes commit them inadvertently from a CTest run inside their working tree (e.g. ITKIOMeshSTL had four such files in test/Baseline/, flagged by @dzenanz on PR InsightSoftwareConsortium#6206). Add **/.ExternalData_* to the scaffolding deny-pass so they never enter ITK history.
Upstream remote modules wrap their top-level CMakeLists.txt in
if(NOT ITK_SOURCE_DIR)
find_package(ITK REQUIRED)
list(APPEND CMAKE_MODULE_PATH ${ITK_CMAKE_DIR})
include(ITKModuleExternal)
else()
itk_module_impl()
endif()
so the same file works both as a fetched remote module and as a
stand-alone CMake project. In an in-tree ingested module
ITK_SOURCE_DIR is always defined, so the if-branch is dead code;
@dzenanz flagged it on PR InsightSoftwareConsortium#6206 (IOMeshSTL).
Add patch_standalone_build_guard() that detects the idiom (allowing
2- or 4-space indent variants) and replaces it with a bare
itk_module_impl() before gersemi formatting runs. Track count via
sanitizer.standalone_guard_patches and print it in the run summary.
The 'Greptile post-ingest patterns' table mixed mechanical artifacts (strippable in Phase A) with semantic concerns (require human judgment). Split into two sections so the operator's checklist shrinks as the sanitizer learns more rules. Promote .ExternalData_* and the standalone-build CMake guard to the auto-sanitized table now that the previous two commits handle them.
Upstream remote modules typically begin their top-level CMakeLists.txt with their own cmake_minimum_required(VERSION X.Y.Z) declaration, often 3.10.2 or earlier. ITK's top-level CMakeLists pins a higher minimum, so the per-module line is redundant and frequently *lower* than ITK's floor. @dzenanz flagged it on every ingest with the comment 'Version behind, not needed.' — most recently on PR InsightSoftwareConsortium#6215 (IOFDF). Add patch_drop_cmake_minimum_required() that detects the idiom and removes the matching line via a multiline regex. Track count via sanitizer.cmake_min_required_drops and print it in the run summary. Apply alongside patch_standalone_build_guard() in the cmake-blob branch of blob_callback.
…cmake
Archival README.md files contain semicolons and bracket characters.
CMake's foreach(arg \${ARGN}) splits on semicolons, so reading them
into a DESCRIPTION argument produces CMake (dev) configure warnings
for every module ingested via v4 (observed: RLEImage, SplitComponents,
IOFDF, IOMeshMZ3, IOMeshSTL).
Add patch_dynamic_description() to sanitize-history.py to strip the
get_filename_component/file(READ README.md) preamble and replace
DESCRIPTION "\${DOCUMENTATION}" with a static one-liner during history
rewrite. Document the pattern in INGESTION_STRATEGY_v4.md.
Member
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
COMP: Add patch_dynamic_description() to strip file(READ README.md) preamble
After ingest, the in-tree
README.mdis the archival migration notice.It contains semicolons and
[characters that CMake list expansion splitsinto spurious
itk_module()arguments, producing configure warnings.This patch adds automatic stripping of the dynamic-DESCRIPTION pattern
during history rewrite so future ingests never introduce it.
Companion to ITK PR #6220 which fixes the five already-ingested modules
(RLEImage, SplitComponents, IOFDF, IOMeshMZ3, IOMeshSTL) that exhibit
the warning today.
Changes
sanitize-history.py: addpatch_dynamic_description()— strips theget_filename_component / file(READ README.md DOCUMENTATION)preambleand replaces
DESCRIPTION "${DOCUMENTATION}"withDESCRIPTION "Module ingested from upstream."during cmake blobprocessing. Counter added to the run summary.
INGESTION_STRATEGY_v4.md: add a row to the "Patterns Phase A nowsanitizes automatically" table documenting the pattern and linking to
the five affected modules.