Skip to content

Add feature tests and manifest schema for archive (tar/zip) support (#23)#1058

Merged
spoorcc merged 35 commits intomainfrom
claude/implement-issue-23-fmipS
Mar 22, 2026
Merged

Add feature tests and manifest schema for archive (tar/zip) support (#23)#1058
spoorcc merged 35 commits intomainfrom
claude/implement-issue-23-fmipS

Conversation

@spoorcc
Copy link
Copy Markdown
Contributor

@spoorcc spoorcc commented Mar 20, 2026

Introduces manifest schema changes for fetching
dependencies from tar/zip archives as described in issue #23.

Summary by CodeRabbit

  • New Features

    • First-class support for archive dependencies (tar/zip) with safe download, extraction, and optional integrity fingerprints (sha256/sha384/sha512).
  • Commands

    • Freeze now pins archives with integrity hashes; Check and Update include Archive scenarios; Update computes ignored files dynamically.
  • Reporting

    • SBOM/reporting treats archives as distribution references and emits hashes when available.
  • Documentation

    • Manifest schema/docs updated for vcs: archive and integrity.hash.
  • Bug Fixes

    • Improved path-traversal protections and deterministic directory hashing.
  • Tests

    • Extensive new unit and feature tests and test helpers for archive workflows.

Summary by CodeRabbit

  • New Features

    • Add archive dependency support with optional integrity fingerprints (sha256/sha384/sha512).
  • Commands

    • freeze records archive integrity; check/update include Archive scenarios; update/patch use dynamic ignored-file callbacks.
  • Reporting

    • SBOM/reporting treats archives as distribution references and surfaces integrity hashes when present.
  • Documentation

    • Manifest schema and docs updated to document vcs: archive and integrity.hash format.
  • Bug Fixes

    • Improved path-traversal protections and deterministic directory hashing.
  • Tests

    • Added extensive unit and feature tests and archive test helpers.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 20, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 22ee6de7-7626-4ebf-b85e-398daca35bf7

📥 Commits

Reviewing files that changed from the base of the PR and between b16855f and 59e59b2.

📒 Files selected for processing (3)
  • doc/_ext/sphinxcontrib_asciinema/.dfetch_data.yaml
  • doc/generate-casts/demo-magic/.dfetch_data.yaml
  • example/dfetch.yaml
✅ Files skipped from review due to trivial changes (3)
  • doc/generate-casts/demo-magic/.dfetch_data.yaml
  • doc/_ext/sphinxcontrib_asciinema/.dfetch_data.yaml
  • example/dfetch.yaml

Walkthrough

Adds archive-backed VCS support: manifest/schema integrity.hash, archive download/verify/extract utilities, ArchiveSubProject with fetch/freeze, PURL/SBOM mapping for archives, ignored-files callback propagation, unified path-traversal checks, extensive tests and BDD feature scenarios.

Changes

Cohort / File(s) Summary
VCS backend
dfetch/vcs/archive.py
New archive utilities: archive URL detection, reachable/downloadable remotes, streamed downloads with optional hash computation, and secure extraction with path-traversal and decompression-bomb protections.
Archive subproject & SubProject API
dfetch/project/archivesubproject.py, dfetch/project/subproject.py, dfetch/project/__init__.py
Added ArchiveSubProject (fetch, verify, freeze_project); refactored SubProject.update to accept ignored_files_callback; added SubProject.freeze_project; registered archive in supported types.
Manifest model & schema
dfetch/manifest/project.py, dfetch/manifest/schema.py, dfetch/manifest/manifest.py, doc/manifest.rst
Introduced Integrity dataclass; ProjectEntry.integrity/hash accessors; schema allows vcs: archive and optional integrity.hash validated by HASH_STR; widened typing and YAML serialization.
Integrity/hash utilities
dfetch/vcs/integrity_hash.py, tests/test_integrity_hash.py
New IntegrityHash class and supported algorithms parsing/matching with unit tests.
Archive PURL & SBOM mapping
dfetch/util/purl.py, dfetch/reporting/sbom_reporter.py
Added vcs_url_to_purl and archive_url_to_purl; SBOM reporter maps archives to distribution externalReferences and emits CycloneDX hashes when integrity is available; component enrichment refactored into helpers.
Filesystem, copy & license helpers
dfetch/util/util.py, dfetch/project/gitsubproject.py, dfetch/project/svnsubproject.py
Added LICENSE_GLOBS/is_license_file, copy/prune/src-subset helpers, check_no_path_traversal, deterministic hashing utilities; git/svn now use shared license helper.
Commands & path-safety / ignored-files
dfetch/commands/freeze.py, dfetch/commands/check.py, dfetch/commands/update.py, dfetch/commands/update_patch.py, dfetch/commands/format_patch.py, dfetch/commands/report.py
Docs add Archive scenarios; freeze delegates to sub_project.freeze_project(); update/check/report integrate archive/integrity flows; ignored-files now passed as callbacks; path-containment checks replaced with check_no_path_traversal.
Superproject / ignored_files call sites
dfetch/project/gitsuperproject.py, dfetch/project/superproject.py, dfetch/project/svnsuperproject.py
Replaced local is_relative_to checks with check_no_path_traversal for path containment validation.
BDD features & step helpers
features/*.feature, features/steps/*.py
New feature specs and step implementations for archive fetch/freeze/check/report; archive creation helpers, manifest/JSON placeholder substitutions, and JSON-subset assertions.
Tests
tests/*
tests/test_archive.py, tests/test_integrity.py, tests/test_purl.py, tests/test_integrity_hash.py, tests/test_subproject.py, tests/test_util.py, tests/test_update.py, tests/test_patch.py
Added and updated tests covering archive utilities, Integrity/ProjectEntry, PURL conversion, extraction safety, hashing behavior, update/freeze flows, and related test adjustments.
Examples & docs/misc
example/dfetch.yaml, doc/manifest.rst, CHANGELOG.rst, .github/workflows/release.yml
Example manifest adds an archive entry with integrity.hash; docs/schema updated for archive/integrity; changelog updated; minor CI whitespace fix.
Logging
dfetch/log.py
Applied Rich markup escaping to user/external-provided text and adjusted external logger configuration to avoid markup leakage.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as Client/CLI
    participant Freeze as Freeze.__call__()
    participant Sub as ArchiveSubProject
    participant Local as ArchiveLocalRepo

    CLI->>Freeze: freeze(project)
    Freeze->>Sub: on_disk_version()
    Sub->>Local: inspect extracted tree / metadata
    Local-->>Sub: on-disk Version or ""
    alt already pinned
        Sub-->>Freeze: None
        Freeze-->>CLI: "Already pinned…" or warning
    else need pin
        Sub->>Sub: compute sha256(on-disk content)
        Sub->>Sub: update project.integrity.hash
        Sub-->>Freeze: "sha256:<hex>"
        Freeze-->>CLI: "Frozen on version sha256:<hex>"
    end
Loading
sequenceDiagram
    participant CLI as Client/CLI
    participant Fetch as ArchiveSubProject.fetch()
    participant Remote as ArchiveRemote
    participant Local as ArchiveLocalRepo

    CLI->>Fetch: fetch(version)
    Fetch->>Remote: download(tempfile)
    Remote-->>Fetch: tempfile path / streamed bytes
    alt integrity.hash provided
        Fetch->>Fetch: compute hash(tempfile)
        Fetch->>Fetch: verify computed == expected
        alt mismatch
            Fetch-->>CLI: raise RuntimeError (hash mismatch)
        end
    end
    Fetch->>Local: extract(tempfile, dest, src, ignore)
    Local-->>Fetch: extracted files
    Fetch->>Fetch: cleanup tempfile
    Fetch-->>CLI: return Version
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

enhancement, testing, bug, documentation

Suggested reviewers

  • ben-edna
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 58.41% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding archive (tar/zip) support with corresponding feature tests and manifest schema updates.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/implement-issue-23-fmipS

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (9)
dfetch/project/subproject.py (1)

395-396: Docstring references outdated manifest field name.

This should mention integrity.hash instead of hash: to match current archive manifest structure.

Proposed wording tweak
-        the hash in the ``hash:`` field rather than ``revision:``).
+        the hash in the ``integrity.hash`` field rather than ``revision:``).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dfetch/project/subproject.py` around lines 395 - 396, Update the docstring
that references the outdated manifest field name by replacing the old "hash:"
mention with the current "integrity.hash" field; locate the docstring text
associated with ArchiveSubProject (reference: ArchiveSubProject / archive
subproject docstring in dfetch/project/subproject.py) and adjust the wording so
it says "integrity.hash" instead of "hash:" to match the current archive
manifest structure.
dfetch/commands/report.py (1)

111-114: Docstring still uses legacy hash: wording.

Please rename this to integrity.hash so terminology matches the current manifest schema and avoids confusion.

Proposed wording tweak
-        metadata is present yet, the ``hash:`` field from the manifest is used
+        metadata is present yet, the ``integrity.hash`` field from the manifest is
+        used
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dfetch/commands/report.py` around lines 111 - 114, The docstring in
report-related text refers to the old manifest field name "hash:"—update the
wording to use the current manifest schema key "integrity.hash" so the paragraph
in report.py that currently reads 'the ``hash:`` field from the manifest is used
as fallback' becomes 'the ``integrity.hash`` field from the manifest is used as
fallback'; ensure the rest of the sentence about archive projects, the metadata
revision (sha256:<hex>) usage, and fallback behavior in the same docstring
(around the block that mentions "For archive projects the sha256 hash...")
remain unchanged.
features/fetch-archive.feature (2)

86-106: Assert the failure semantics on hash mismatch.

This only checks the message. A regression that logs the mismatch but still exits 0 or leaves a partially extracted SomeProject/ tree behind would still pass; please add a return-code and filesystem assertion here.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@features/fetch-archive.feature` around lines 86 - 106, The scenario currently
only asserts the error message; update the feature to also assert the process
exits non-zero and that no partially extracted directory remains: after running
"dfetch update" for the manifest with integrity mismatch, assert the command's
exit code is non-zero (non-0) and assert the filesystem does not contain a
"SomeProject/" directory or any extracted files (e.g., no README.md in the
workspace). Reference the scenario steps using the same manifest/SomeProject
names and the "dfetch update" invocation to implement the exit-code check and
the filesystem absence assertion.

164-184: Make the --force scenario prove the archive was refreshed.

Right now the second update reuses the same remote archive and only asserts the log line. If dfetch update --force skipped re-extraction but still printed Fetched, this scenario would still pass. Mutating the archive between the two updates and asserting the local file content changes would make this cover the force path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@features/fetch-archive.feature` around lines 164 - 184, Update the "Archive
is re-fetched when force flag is given" scenario to mutate the remote archive
between the first update and the forced update so the test verifies
re-extraction: after the first "And all projects are updated in MyProject" run,
change the archived README.md content (or replace the archive file used by the
Given an archive "SomeProject.tar.gz" step), then run "When I run \"dfetch
update --force\" in MyProject" and assert the local README.md (or other file
from SomeProject) now contains the new content instead of only checking the
"Fetched" log line; ensure the scenario references the same archive name and the
final assertion inspects the extracted file under MyProject to prove the
refresh.
features/steps/generic_steps.py (1)

246-259: Route output assertions through _apply_context_substitutions() too.

check_output() duplicates the archive placeholder handling but only replaces <archive-sha256>. Reusing the shared helper here keeps <archive-url> and any future placeholders consistent between text and JSON assertions.

Patch sketch
-    expected_raw = context.text
-    if hasattr(context, "archive_sha256"):
-        expected_raw = expected_raw.replace("<archive-sha256>", context.archive_sha256)
+    expected_raw = _apply_context_substitutions(context.text, context)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@features/steps/generic_steps.py` around lines 246 - 259, The block building
expected_text should use the shared helper _apply_context_substitutions()
instead of manually handling only "<archive-sha256>" so all placeholders (e.g.
"<archive-url>", future placeholders) are applied consistently; replace the
manual hasattr/context.archive_sha256 replacement and passing of expected_raw
into multisub with a call to _apply_context_substitutions(context, context.text)
(or apply it to expected_raw before multisub) and then feed that substituted
text into multisub to produce expected_text, keeping the multisub call (patterns
list) and using the same symbols (expected_raw, expected_text, multisub,
_apply_context_substitutions) to locate and update the code.
features/check-archive.feature (1)

95-117: Add an ignored-file variant for the local-changes path.

This scenario only edits README.md, so it never exercises archive + ignore: handling. A regression where ignored files are checked with the fetched project's VCS would still pass here; I'd add a case that mutates an ignored path and verifies dfetch check stays clean. Based on learnings: In dfetch, when determining ignored files for a project, the VCS detection should use the superproject's VCS type (where the manifest is located), not the fetched project's VCS type.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@features/check-archive.feature` around lines 95 - 117, Add a new scenario
variant in features/check-archive.feature that mutates a path marked with
ignore: (e.g., add an ignore: entry for "ignored.txt" in the dfetch.yaml for
SomeProject and then modify SomeProject/ignored.txt locally) and assert that
running "dfetch check SomeProject" reports no local changes; this ensures
archive + ignore handling is exercised. Also update the VCS-detection logic so
that when computing ignored files for a fetched project it uses the
superproject’s VCS type (the VCS where dfetch.yaml lives) rather than the
fetched project's VCS — apply this change in the code path that determines VCS
for ignore processing so dfetch check and ignore: behave correctly for archive
projects.
dfetch/project/archivesubproject.py (1)

73-78: Minor: Redundant storage of project entry.

self._project_entry is stored separately, but the base SubProject class already stores the project entry as self._manifest. Consider using the inherited attribute to avoid duplication.

♻️ Suggested simplification

You could use self._manifest (inherited from SubProject) instead of maintaining a separate self._project_entry reference. However, if the explicit naming improves readability in this context, this is a minor nit.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dfetch/project/archivesubproject.py` around lines 73 - 78,
ArchiveSubProject.__init__ currently assigns the project to self._project_entry
even though SubProject already stores it as self._manifest; remove the redundant
self._project_entry = project assignment and update any internal references in
ArchiveSubProject (and methods that might reference _project_entry) to use the
inherited self._manifest instead, leaving the existing
ArchiveRemote(project.remote_url) construction as-is (or pass
self._manifest.remote_url if you prefer consistency).
dfetch/util/purl.py (1)

39-45: Consider consolidating duplicate archive extension constants.

_ARCHIVE_EXTENSIONS duplicates the ARCHIVE_EXTENSIONS constant already defined in dfetch/vcs/archive.py (line 44). This creates a maintenance burden if the supported extensions change.

♻️ Suggested refactor
+from dfetch.vcs.archive import ARCHIVE_EXTENSIONS as _ARCHIVE_EXTENSIONS
+
 # Map from dfetch hash-field algorithm prefix to CycloneDX HashAlgorithm name
 DFETCH_TO_CDX_HASH_ALGORITHM: dict[str, str] = {
     "sha256": "SHA-256",
 }
-
-# Archive file extensions recognised as downloadable archive artifacts
-_ARCHIVE_EXTENSIONS = (".tar.gz", ".tgz", ".tar.bz2", ".tar.xz", ".zip")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dfetch/util/purl.py` around lines 39 - 45, Remove the duplicated
_ARCHIVE_EXTENSIONS constant and reuse the single source of truth
(ARCHIVE_EXTENSIONS) already defined elsewhere: delete _ARCHIVE_EXTENSIONS from
purl.py, add an import for ARCHIVE_EXTENSIONS from the module that defines it,
and update any local references to use ARCHIVE_EXTENSIONS (or an aliased name if
needed) so supported archive extensions remain maintained in one place; leave
DFETCH_TO_CDX_HASH_ALGORITHM unchanged.
features/steps/archive_steps.py (1)

45-53: Potential issue with URL construction on Windows.

Line 52 splits the path by os.sep and rejoins with /, but context.remotes_dir_path could have mixed separators on Windows. Consider using pathlib.Path for more robust path handling.

♻️ Suggested improvement
 def _archive_url(context, filename: str) -> str:
     """Build the archive URL in the same format used by apply_manifest_substitutions.
 
     apply_manifest_substitutions produces ``file:///`` + absolute path, which for an
     absolute path like ``/tmp/...`` yields four slashes (``file:////tmp/...``).
     We must match that format so placeholder substitution works in SBOM assertions.
     """
-    server_fwd = "/".join(context.remotes_dir_path.split(os.sep))
+    server_fwd = pathlib.Path(context.remotes_dir_path).as_posix()
     return f"file:///{server_fwd}/{filename}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@features/steps/archive_steps.py` around lines 45 - 53, The URL construction
in _archive_url uses split/join on os.sep which can fail with mixed separators
on Windows; replace the server_fwd computation to use pathlib.Path for stable
POSIX-style paths (e.g., compute server_fwd from
pathlib.Path(context.remotes_dir_path).as_posix()) so backslashes are normalized
and drive-letter/absolute path formats are preserved before forming the final
f"file:///{server_fwd}/{filename}" return value.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@dfetch/commands/freeze.py`:
- Around line 88-105: The freeze flow is collapsing two different failure modes
into False; update freeze_project (in dfetch.project.create_sub_project's class,
e.g., the ArchiveSubProject implementation) to return a distinguishable result
(e.g., (success: bool, error: Optional[Exception]|str) or an enum) or raise on
download/hash failures, and adjust the caller in freeze.py to: call
sub_project.freeze_project(project), detect a download/hash failure vs "already
pinned", log a clear warning/error when _download_and_compute_hash fails (do not
claim "Already pinned"), and after a successful freeze re-query
sub_project.on_disk_version() (or use a returned written_version) to print the
actual sha256: value that was written instead of the pre-freeze URL.
- Around line 39-42: Update the documentation string in
dfetch/commands/freeze.py so it reflects the new manifest shape: state that for
archive projects `dfetch freeze` adds the hash under the nested key
`integrity.hash` (e.g. `integrity.hash: sha256:<hex>`) rather than a top-level
`hash` field; locate the explanatory text currently describing the archive pin
(the lines mentioning "adds a `hash: sha256:<hex>` field") and change it to
describe the nested `integrity.hash` key and its purpose so the doc matches
validation and the manifest produced by the freeze command.

In `@dfetch/manifest/schema.py`:
- Line 18: The current HASH_STR regex allows any-length hex after "sha256:";
update the HASH_STR constant in schema.py to require exactly 64 hex characters
(i.e., match a full SHA-256 digest) by changing the pattern to enforce {64} hex
chars while preserving the "sha256:" prefix and case handling; modify the Regex
for HASH_STR accordingly so manifest validation only accepts valid 64-character
SHA-256 pins.

In `@dfetch/vcs/archive.py`:
- Around line 251-259: The call to tarfile.extractall in ArchiveLocalRepo (lines
around ArchiveLocalRepo._check_tar_members and tarfile.open) uses the
filter="tar" kw only available in Python 3.11.4+ and will raise TypeError on
supported 3.10/early-3.11; change the extraction to pass filter conditionally
(e.g. detect sys.version_info >= (3,11,4) or try/except TypeError) and only
include the filter arg when supported, ensuring sys is imported and behavior for
zip extraction (zipfile.ZipFile.extractall and
ArchiveLocalRepo._check_zip_members) remains unchanged.

---

Nitpick comments:
In `@dfetch/commands/report.py`:
- Around line 111-114: The docstring in report-related text refers to the old
manifest field name "hash:"—update the wording to use the current manifest
schema key "integrity.hash" so the paragraph in report.py that currently reads
'the ``hash:`` field from the manifest is used as fallback' becomes 'the
``integrity.hash`` field from the manifest is used as fallback'; ensure the rest
of the sentence about archive projects, the metadata revision (sha256:<hex>)
usage, and fallback behavior in the same docstring (around the block that
mentions "For archive projects the sha256 hash...") remain unchanged.

In `@dfetch/project/archivesubproject.py`:
- Around line 73-78: ArchiveSubProject.__init__ currently assigns the project to
self._project_entry even though SubProject already stores it as self._manifest;
remove the redundant self._project_entry = project assignment and update any
internal references in ArchiveSubProject (and methods that might reference
_project_entry) to use the inherited self._manifest instead, leaving the
existing ArchiveRemote(project.remote_url) construction as-is (or pass
self._manifest.remote_url if you prefer consistency).

In `@dfetch/project/subproject.py`:
- Around line 395-396: Update the docstring that references the outdated
manifest field name by replacing the old "hash:" mention with the current
"integrity.hash" field; locate the docstring text associated with
ArchiveSubProject (reference: ArchiveSubProject / archive subproject docstring
in dfetch/project/subproject.py) and adjust the wording so it says
"integrity.hash" instead of "hash:" to match the current archive manifest
structure.

In `@dfetch/util/purl.py`:
- Around line 39-45: Remove the duplicated _ARCHIVE_EXTENSIONS constant and
reuse the single source of truth (ARCHIVE_EXTENSIONS) already defined elsewhere:
delete _ARCHIVE_EXTENSIONS from purl.py, add an import for ARCHIVE_EXTENSIONS
from the module that defines it, and update any local references to use
ARCHIVE_EXTENSIONS (or an aliased name if needed) so supported archive
extensions remain maintained in one place; leave DFETCH_TO_CDX_HASH_ALGORITHM
unchanged.

In `@features/check-archive.feature`:
- Around line 95-117: Add a new scenario variant in
features/check-archive.feature that mutates a path marked with ignore: (e.g.,
add an ignore: entry for "ignored.txt" in the dfetch.yaml for SomeProject and
then modify SomeProject/ignored.txt locally) and assert that running "dfetch
check SomeProject" reports no local changes; this ensures archive + ignore
handling is exercised. Also update the VCS-detection logic so that when
computing ignored files for a fetched project it uses the superproject’s VCS
type (the VCS where dfetch.yaml lives) rather than the fetched project's VCS —
apply this change in the code path that determines VCS for ignore processing so
dfetch check and ignore: behave correctly for archive projects.

In `@features/fetch-archive.feature`:
- Around line 86-106: The scenario currently only asserts the error message;
update the feature to also assert the process exits non-zero and that no
partially extracted directory remains: after running "dfetch update" for the
manifest with integrity mismatch, assert the command's exit code is non-zero
(non-0) and assert the filesystem does not contain a "SomeProject/" directory or
any extracted files (e.g., no README.md in the workspace). Reference the
scenario steps using the same manifest/SomeProject names and the "dfetch update"
invocation to implement the exit-code check and the filesystem absence
assertion.
- Around line 164-184: Update the "Archive is re-fetched when force flag is
given" scenario to mutate the remote archive between the first update and the
forced update so the test verifies re-extraction: after the first "And all
projects are updated in MyProject" run, change the archived README.md content
(or replace the archive file used by the Given an archive "SomeProject.tar.gz"
step), then run "When I run \"dfetch update --force\" in MyProject" and assert
the local README.md (or other file from SomeProject) now contains the new
content instead of only checking the "Fetched" log line; ensure the scenario
references the same archive name and the final assertion inspects the extracted
file under MyProject to prove the refresh.

In `@features/steps/archive_steps.py`:
- Around line 45-53: The URL construction in _archive_url uses split/join on
os.sep which can fail with mixed separators on Windows; replace the server_fwd
computation to use pathlib.Path for stable POSIX-style paths (e.g., compute
server_fwd from pathlib.Path(context.remotes_dir_path).as_posix()) so
backslashes are normalized and drive-letter/absolute path formats are preserved
before forming the final f"file:///{server_fwd}/{filename}" return value.

In `@features/steps/generic_steps.py`:
- Around line 246-259: The block building expected_text should use the shared
helper _apply_context_substitutions() instead of manually handling only
"<archive-sha256>" so all placeholders (e.g. "<archive-url>", future
placeholders) are applied consistently; replace the manual
hasattr/context.archive_sha256 replacement and passing of expected_raw into
multisub with a call to _apply_context_substitutions(context, context.text) (or
apply it to expected_raw before multisub) and then feed that substituted text
into multisub to produce expected_text, keeping the multisub call (patterns
list) and using the same symbols (expected_raw, expected_text, multisub,
_apply_context_substitutions) to locate and update the code.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6c1df770-4e1c-4c4a-b70b-e115c836d6fb

📥 Commits

Reviewing files that changed from the base of the PR and between 843034a and 502f1ad.

📒 Files selected for processing (21)
  • dfetch/commands/check.py
  • dfetch/commands/freeze.py
  • dfetch/commands/report.py
  • dfetch/commands/update.py
  • dfetch/manifest/project.py
  • dfetch/manifest/schema.py
  • dfetch/project/__init__.py
  • dfetch/project/archivesubproject.py
  • dfetch/project/subproject.py
  • dfetch/reporting/sbom_reporter.py
  • dfetch/util/purl.py
  • dfetch/vcs/archive.py
  • doc/manifest.rst
  • features/check-archive.feature
  • features/fetch-archive.feature
  • features/freeze-archive.feature
  • features/report-sbom-archive.feature
  • features/steps/archive_steps.py
  • features/steps/generic_steps.py
  • features/steps/manifest_steps.py
  • features/validate-manifest.feature

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
dfetch/vcs/archive.py (2)

103-119: Potential issue: HEAD request may not work for all archive URLs.

Some servers (especially file hosting services and CDNs) reject HEAD requests or return different status codes than for GET. Additionally, file:// URLs with method="HEAD" may behave unexpectedly.

💡 Consider a fallback or documentation note

For more robust accessibility checks, consider:

  1. Falling back to a GET request with Range: bytes=0-0 header if HEAD fails
  2. Or documenting that is_accessible() is best-effort and may return false negatives

This is optional since the current implementation handles exceptions gracefully by returning False.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dfetch/vcs/archive.py` around lines 103 - 119, The is_accessible() method
currently only issues a HEAD request which some servers reject; update
ArchiveRemote.is_accessible to try a HEAD first and on failure (catching
urllib.error.URLError or ValueError) retry with a lightweight GET using the
header "Range: bytes=0-0" (or equivalent) to avoid downloading the whole
archive, and ensure file:// URLs are handled by testing path existence directly
(os.path.exists) instead of using urllib; keep existing timeout and exception
handling and return False on any final failure.

229-239: Consider adding path traversal validation for TAR archives on older Python.

_check_tar_members validates member count and total size but doesn't check for path traversal attacks (absolute paths or .. components). The filter="tar" parameter handles this on Python ≥3.11.4, but on older versions (3.10, 3.11.0–3.11.3) the extraction at line 263 proceeds without path sanitization.

For consistency with the ZIP validation in _check_zip_members, consider adding explicit path checks:

♻️ Suggested path validation for TAR members
 `@staticmethod`
 def _check_tar_members(tf: tarfile.TarFile) -> None:
     """Validate TAR member count and total size against decompression bombs.

     Raises:
-        RuntimeError: When the archive exceeds the size/count limits.
+        RuntimeError: When the archive exceeds the size/count limits or
+            contains unsafe member paths (on Python < 3.11.4).
     """
     members = tf.getmembers()
     ArchiveLocalRepo._check_archive_limits(
         len(members), sum(m.size for m in members if m.isfile())
     )
+    # On older Python without filter="tar", manually reject unsafe paths
+    if sys.version_info < (3, 11, 4):
+        for m in members:
+            member_path = pathlib.PurePosixPath(m.name)
+            if member_path.is_absolute() or any(
+                part == ".." for part in member_path.parts
+            ):
+                raise RuntimeError(
+                    f"Archive contains an unsafe member path: {m.name!r}"
+                )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dfetch/vcs/archive.py` around lines 229 - 239, The TAR member check in
_check_tar_members currently only enforces count/size via tf.getmembers() and
ArchiveLocalRepo._check_archive_limits; add path traversal validation like in
_check_zip_members by iterating members from tf.getmembers(), rejecting any
member with absolute paths or any '..' path component (e.g., using os.path.isabs
and normalizing with os.path.normpath to detect parent traversal), and raise
RuntimeError if such unsafe paths are found before extraction; keep the existing
size/count checks and ensure the same exception/messages semantics as
ArchiveLocalRepo._check_archive_limits so extraction code that follows will not
process unsafe members.
dfetch/project/archivesubproject.py (2)

130-148: Verify the cost of _does_revision_exist for hash-pinned archives.

When revision is a hash (e.g., sha256:<hex>), this method downloads the entire archive and computes its hash just to check existence. This could be expensive for large archives during dfetch check.

This might be intentional for integrity verification, but consider documenting this behavior or adding a lighter-weight check option.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dfetch/project/archivesubproject.py` around lines 130 - 148, The
_does_revision_exist method currently downloads the entire archive when revision
is hash-pinned (loop over SUPPORTED_HASH_ALGORITHMS and call to
_download_and_compute_hash) which is expensive; either document this behavior in
the method docstring or add an option to perform a lighter existence check (for
example add a parameter like perform_integrity_check: bool defaulting to True)
and when False skip downloading and instead return a quicker response (e.g. call
_remote_repo.is_accessible() or a HEAD-only check), and ensure callers of
_does_revision_exist (and tests) are updated to pass the flag where the
lightweight check is desired; keep the existing integrity path using
_download_and_compute_hash when the full verification is required.

50-58: Importing private functions from sibling module.

_safe_compare_hex and _suffix_for_url are prefixed with _ indicating they're private to dfetch.vcs.archive. While acceptable for internal use within the same package, consider either:

  1. Removing the _ prefix to make them part of the public API
  2. Or adding a comment noting the intentional cross-module usage
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dfetch/project/archivesubproject.py` around lines 50 - 58, The import of
private symbols _safe_compare_hex and _suffix_for_url from dfetch.vcs.archive
should be marked as intentional: in archivesubproject.py add a short comment
above the import line stating these are private helpers being imported
deliberately for internal package use (or alternatively rename the functions in
dfetch.vcs.archive to safe_compare_hex and suffix_for_url if you want them
public); reference the symbols _safe_compare_hex and _suffix_for_url and ensure
the comment explains the reason for cross-module access and that they are stable
for internal use.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@dfetch/project/archivesubproject.py`:
- Around line 201-208: The code unpacks expected_hash with
expected_hash.split(":", 1) which can raise ValueError if the string lacks a
colon; update the logic in the block where expected_hash is handled (around
compute_hash, _safe_compare_hex, and self._project_entry.name) to validate or
parse safely: either check if ":" in expected_hash before splitting and raise a
clear RuntimeError/ValueError with context if malformed, or use a try/except
around the split to catch ValueError and raise a descriptive error about the
malformed expected_hash for this project entry; ensure the rest of the flow
(calling compute_hash and _safe_compare_hex) only runs when you have both
algorithm and expected_hex.

---

Nitpick comments:
In `@dfetch/project/archivesubproject.py`:
- Around line 130-148: The _does_revision_exist method currently downloads the
entire archive when revision is hash-pinned (loop over SUPPORTED_HASH_ALGORITHMS
and call to _download_and_compute_hash) which is expensive; either document this
behavior in the method docstring or add an option to perform a lighter existence
check (for example add a parameter like perform_integrity_check: bool defaulting
to True) and when False skip downloading and instead return a quicker response
(e.g. call _remote_repo.is_accessible() or a HEAD-only check), and ensure
callers of _does_revision_exist (and tests) are updated to pass the flag where
the lightweight check is desired; keep the existing integrity path using
_download_and_compute_hash when the full verification is required.
- Around line 50-58: The import of private symbols _safe_compare_hex and
_suffix_for_url from dfetch.vcs.archive should be marked as intentional: in
archivesubproject.py add a short comment above the import line stating these are
private helpers being imported deliberately for internal package use (or
alternatively rename the functions in dfetch.vcs.archive to safe_compare_hex and
suffix_for_url if you want them public); reference the symbols _safe_compare_hex
and _suffix_for_url and ensure the comment explains the reason for cross-module
access and that they are stable for internal use.

In `@dfetch/vcs/archive.py`:
- Around line 103-119: The is_accessible() method currently only issues a HEAD
request which some servers reject; update ArchiveRemote.is_accessible to try a
HEAD first and on failure (catching urllib.error.URLError or ValueError) retry
with a lightweight GET using the header "Range: bytes=0-0" (or equivalent) to
avoid downloading the whole archive, and ensure file:// URLs are handled by
testing path existence directly (os.path.exists) instead of using urllib; keep
existing timeout and exception handling and return False on any final failure.
- Around line 229-239: The TAR member check in _check_tar_members currently only
enforces count/size via tf.getmembers() and
ArchiveLocalRepo._check_archive_limits; add path traversal validation like in
_check_zip_members by iterating members from tf.getmembers(), rejecting any
member with absolute paths or any '..' path component (e.g., using os.path.isabs
and normalizing with os.path.normpath to detect parent traversal), and raise
RuntimeError if such unsafe paths are found before extraction; keep the existing
size/count checks and ensure the same exception/messages semantics as
ArchiveLocalRepo._check_archive_limits so extraction code that follows will not
process unsafe members.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2fcb8335-169b-4f2d-84b7-c652531ebcff

📥 Commits

Reviewing files that changed from the base of the PR and between 502f1ad and a9b2fff.

📒 Files selected for processing (8)
  • dfetch/commands/freeze.py
  • dfetch/commands/report.py
  • dfetch/manifest/schema.py
  • dfetch/project/archivesubproject.py
  • dfetch/project/subproject.py
  • dfetch/util/purl.py
  • dfetch/vcs/archive.py
  • features/steps/generic_steps.py
✅ Files skipped from review due to trivial changes (1)
  • dfetch/commands/report.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • dfetch/manifest/schema.py
  • features/steps/generic_steps.py
  • dfetch/project/subproject.py

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@dfetch/project/archivesubproject.py`:
- Around line 107-122: The code is using self._project_entry.remote_url instead
of the specific revision/manifest URL being inspected or frozen; update the
logic so _download_and_compute_hash, _does_revision_exist, and freeze_project
use the revision-specific URL (the URL recorded in on-disk metadata or the
revision argument) rather than the current manifest URL on the project entry.
Specifically, modify _does_revision_exist to honor its revision parameter
(resolve the URL for that revision instead of ignoring it), change
freeze_project to read the stored manifest/remote URL for the revision being
frozen (not self._project_entry.remote_url), and make _download_and_compute_hash
accept/derive the revision-specific URL or path so download(tmp_path) and
compute_hash operate on the exact archive referenced by the revision metadata.

In `@dfetch/vcs/archive.py`:
- Around line 306-328: The _copy_with_src function currently silently does
nothing when the requested src path is missing; update it to fail fast by
checking whether src_path exists before proceeding (use os.path.exists(src_path)
or not (os.path.isdir(src_path) or os.path.isfile(src_path))) and raise a clear
exception (e.g., ValueError or a custom error) that includes the requested src
and extract_root so the caller can surface a manifest/layout error; keep the
existing license-copying logic but only run it after validating src_path exists.
- Around line 247-295: Update the Python-version gate in _extract_raw to use
sys.version_info >= (3, 11, 5) (since filter="tar" was added in 3.11.5) and
harden ArchiveLocalRepo._check_tar_members to also reject unsafe
symlink/hardlink/device entries: for each tar member (in _check_tar_members) in
addition to validating member.name, check member.issym() and member.islnk() and
validate member.linkname for absolute paths or '..' components (reject if
unsafe), and reject device nodes by checking member.type against tarfile.CHRTYPE
and tarfile.BLKTYPE (or using any available tarfile API to detect
character/block devices); raise RuntimeError for any such unsafe member so
extraction on older Python releases cannot introduce unsafe links or device
files.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 32ba4587-207a-4dd2-8783-bf10ab859bb8

📥 Commits

Reviewing files that changed from the base of the PR and between a9b2fff and 8c9d0b2.

📒 Files selected for processing (3)
  • dfetch/project/archivesubproject.py
  • dfetch/vcs/archive.py
  • features/steps/archive_steps.py

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
dfetch/project/archivesubproject.py (1)

112-127: ⚠️ Potential issue | 🟠 Major

Freeze can hash a different archive than the one on disk.

Line 265 reads the fetched revision, but Line 276 still hashes self._project_entry.remote_url via _download_and_compute_hash(). If the manifest URL was edited after the last fetch, freeze will pin the new URL’s content instead of the archive recorded in metadata.

🛠️ Proposed fix
-    def _download_and_compute_hash(self, algorithm: str = "sha256") -> str:
+    def _download_and_compute_hash(
+        self, url: str, algorithm: str = "sha256"
+    ) -> str:
         """Download the archive to a temporary file and return its hash.
@@
         try:
             with tempfile.NamedTemporaryFile(
-                suffix=_suffix_for_url(self._project_entry.remote_url), delete=False
+                suffix=_suffix_for_url(url), delete=False
             ) as tmp:
                 tmp_path = tmp.name
-            self._remote_repo.download(tmp_path)
+            ArchiveRemote(url).download(tmp_path)
             return compute_hash(tmp_path, algorithm)
@@
                 expected_hex = revision.split(":", 1)[1]
                 try:
-                    actual = self._download_and_compute_hash(algo)
+                    actual = self._download_and_compute_hash(
+                        self._project_entry.remote_url, algo
+                    )
                     return _safe_compare_hex(actual, expected_hex)
                 except RuntimeError:
                     return False
@@
-        hex_value = self._download_and_compute_hash("sha256")
+        hex_value = self._download_and_compute_hash(revision, "sha256")

Also applies to: 145-150, 265-276

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dfetch/project/archivesubproject.py` around lines 112 - 127, The hashing
routine currently always uses self._project_entry.remote_url which can differ
from the URL recorded when the project was last fetched; update
_download_and_compute_hash (and its callers) to use the URL saved in the project
metadata (the stored "fetched" URL/revision on the ProjectEntry) rather than the
live remote_url so we download and hash the exact archived file recorded in
metadata; specifically, change _download_and_compute_hash to accept (or read)
the recorded URL field (e.g. fetched_revision/fetched_url on ProjectEntry) and
use that value for _suffix_for_url(...) and for self._remote_repo.download(...)
and update any call sites that relied on the old signature to pass the stored
URL.
🧹 Nitpick comments (1)
tests/test_purl.py (1)

184-187: Test correctly verifies version parameter handling.

The test confirms that the version argument is correctly plumbed through to purl.version.

Consider adding a quick sanity assertion to ensure the archive detection still works correctly when version is provided:

💡 Optional: Add minimal sanity check
 def test_archive_purl_with_version():
     url = "https://example.com/lib-1.0.tar.gz"
     purl = remote_url_to_purl(url, version="sha256:" + "a" * 64)
+    assert purl.type == "generic"  # sanity check: still detected as archive
     assert purl.version == "sha256:" + "a" * 64
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_purl.py` around lines 184 - 187, The test currently asserts that
remote_url_to_purl(url, version=...) preserves the version but doesn't verify
archive detection; update test_archive_purl_with_version to also assert the PURL
identifies the input as an archive (e.g., check purl.type == "archive" or
another archive-indicating attribute used by remote_url_to_purl) so the
archive-detection path is exercised when a version is supplied.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/test_archive.py`:
- Around line 246-259: Replace platform-specific string formatting in
test_is_accessible_existing_file and test_is_accessible_missing_file with
pathlib.Path.as_uri(): create the temp file path, convert it to a file URI via
pathlib.Path(path).as_uri() and pass that to ArchiveRemote so URIs are
RFC-compliant across OSes; for the missing-file test, build a
guaranteed-nonexistent path inside a temporary directory (e.g., create a tempdir
and append a filename that is not created) and convert that Path to a URI with
as_uri() before calling ArchiveRemote.is_accessible().

---

Duplicate comments:
In `@dfetch/project/archivesubproject.py`:
- Around line 112-127: The hashing routine currently always uses
self._project_entry.remote_url which can differ from the URL recorded when the
project was last fetched; update _download_and_compute_hash (and its callers) to
use the URL saved in the project metadata (the stored "fetched" URL/revision on
the ProjectEntry) rather than the live remote_url so we download and hash the
exact archived file recorded in metadata; specifically, change
_download_and_compute_hash to accept (or read) the recorded URL field (e.g.
fetched_revision/fetched_url on ProjectEntry) and use that value for
_suffix_for_url(...) and for self._remote_repo.download(...) and update any call
sites that relied on the old signature to pass the stored URL.

---

Nitpick comments:
In `@tests/test_purl.py`:
- Around line 184-187: The test currently asserts that remote_url_to_purl(url,
version=...) preserves the version but doesn't verify archive detection; update
test_archive_purl_with_version to also assert the PURL identifies the input as
an archive (e.g., check purl.type == "archive" or another archive-indicating
attribute used by remote_url_to_purl) so the archive-detection path is exercised
when a version is supplied.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: de6e17bb-c6f0-4665-a9f3-5c2312dc7628

📥 Commits

Reviewing files that changed from the base of the PR and between 8c9d0b2 and c8ced39.

📒 Files selected for processing (11)
  • .github/workflows/release.yml
  • dfetch/manifest/manifest.py
  • dfetch/manifest/project.py
  • dfetch/project/__init__.py
  • dfetch/project/archivesubproject.py
  • dfetch/project/subproject.py
  • dfetch/reporting/sbom_reporter.py
  • dfetch/vcs/archive.py
  • tests/test_archive.py
  • tests/test_integrity.py
  • tests/test_purl.py
✅ Files skipped from review due to trivial changes (1)
  • .github/workflows/release.yml
🚧 Files skipped from review as they are similar to previous changes (4)
  • dfetch/project/init.py
  • dfetch/vcs/archive.py
  • dfetch/reporting/sbom_reporter.py
  • dfetch/manifest/project.py

@spoorcc spoorcc force-pushed the claude/implement-issue-23-fmipS branch from c8ced39 to e997e3f Compare March 20, 2026 22:42
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
dfetch/reporting/sbom_reporter.py (1)

255-307: ⚠️ Potential issue | 🟡 Minor

Preserve component.group for GitHub and Bitbucket PURLs.

This helper now only assigns group in the archive/generic branches, so GitHub/Bitbucket components lose their org/workspace grouping after the refactor.

Proposed fix
     def _apply_external_references(
         component: Component, purl: PackageURL, version: str
     ) -> None:
         """Attach external references to *component* based on its PURL type."""
+        if purl.namespace:
+            component.group = purl.namespace
+
         if purl.type == "github":
             component.external_references.add(
                 ExternalReference(
                     type=ExternalReferenceType.VCS,
                     url=XsUri(f"https://github.com/{purl.namespace}/{purl.name}"),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dfetch/reporting/sbom_reporter.py` around lines 255 - 307, The
GitHub/Bitbucket branches in _apply_external_references no longer set
component.group, causing loss of org/workspace grouping; update those branches
to set component.group = purl.namespace or None (same as _apply_archive_refs)
before adding the ExternalReference so GitHub and Bitbucket components retain
their group; touched symbols: SbomReporter._apply_external_references,
SbomReporter._apply_archive_refs, component.group, purl.namespace.
🧹 Nitpick comments (1)
features/steps/generic_steps.py (1)

116-122: List subset matching allows duplicates in expected to match single actual items.

The current logic checks that each expected item has some match in actual, but doesn't ensure uniqueness. For example, expected=[{"id": 1}, {"id": 1}] would match actual=[{"id": 1}].

This is likely acceptable for BDD test assertions where you're checking "these items exist" rather than exact cardinality. If stricter matching is ever needed, you'd need to track which actual items have been matched.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@features/steps/generic_steps.py` around lines 116 - 122, The list-handling
branch in _json_subset_matches currently allows multiple expected items to match
the same actual item; to enforce unique matching, replace the any/all approach
with a match-tracking loop: when isinstance(expected, list) (and actual is a
list) iterate over each exp_item and for each find the first act_item that is
not already matched and for which _json_subset_matches(exp_item, act_item)
returns True, mark that act_item as matched (e.g., by index in a boolean list or
removing from a working copy of actual) and continue; if any exp_item finds no
unmatched match return False, otherwise return True. Ensure you reference
_json_subset_matches, expected and actual in the updated logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@dfetch/project/subproject.py`:
- Around line 405-412: The code currently compares whole Version objects and
treats any branch differences as needing a rewrite; instead, consider versions
equal if their pinned identity (tag and revision) match. Replace the full-object
equality check between project.version and on_disk_version with a comparison of
their tag and revision attributes (e.g. if project.version.tag ==
on_disk_version.tag and project.version.revision == on_disk_version.revision:
return None) so that extra branch data on on_disk_version does not trigger
refreezing; keep the subsequent assignment to project.version only when the
tag/revision differ.

In `@dfetch/util/util.py`:
- Around line 44-63: The current copy_src_subset implementation can be tricked
by absolute paths or path traversal in src because it joins src_root and src
without validating the result; change copy_src_subset to resolve and normalize
both src_root and the joined path (use os.path.realpath or pathlib.Path.resolve)
into absolute paths, then verify the resolved src path is inside the resolved
src_root (e.g., with os.path.commonpath([resolved_src_root, resolved_src_path])
== resolved_src_root); if the check fails, raise RuntimeError before calling
copy_directory_contents or shutil.copy2; ensure you reference and replace uses
of src_path with the validated/resolved path and keep existing behavior for
directory vs file copies.

---

Outside diff comments:
In `@dfetch/reporting/sbom_reporter.py`:
- Around line 255-307: The GitHub/Bitbucket branches in
_apply_external_references no longer set component.group, causing loss of
org/workspace grouping; update those branches to set component.group =
purl.namespace or None (same as _apply_archive_refs) before adding the
ExternalReference so GitHub and Bitbucket components retain their group; touched
symbols: SbomReporter._apply_external_references,
SbomReporter._apply_archive_refs, component.group, purl.namespace.

---

Nitpick comments:
In `@features/steps/generic_steps.py`:
- Around line 116-122: The list-handling branch in _json_subset_matches
currently allows multiple expected items to match the same actual item; to
enforce unique matching, replace the any/all approach with a match-tracking
loop: when isinstance(expected, list) (and actual is a list) iterate over each
exp_item and for each find the first act_item that is not already matched and
for which _json_subset_matches(exp_item, act_item) returns True, mark that
act_item as matched (e.g., by index in a boolean list or removing from a working
copy of actual) and continue; if any exp_item finds no unmatched match return
False, otherwise return True. Ensure you reference _json_subset_matches,
expected and actual in the updated logic.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7f109bb1-60bb-4336-9166-b771f351e189

📥 Commits

Reviewing files that changed from the base of the PR and between c8ced39 and e997e3f.

📒 Files selected for processing (17)
  • .github/workflows/release.yml
  • dfetch/commands/report.py
  • dfetch/manifest/manifest.py
  • dfetch/manifest/project.py
  • dfetch/project/__init__.py
  • dfetch/project/archivesubproject.py
  • dfetch/project/gitsubproject.py
  • dfetch/project/subproject.py
  • dfetch/project/svnsubproject.py
  • dfetch/reporting/sbom_reporter.py
  • dfetch/util/purl.py
  • dfetch/util/util.py
  • dfetch/vcs/archive.py
  • features/steps/generic_steps.py
  • tests/test_archive.py
  • tests/test_integrity.py
  • tests/test_patch.py
✅ Files skipped from review due to trivial changes (4)
  • tests/test_patch.py
  • .github/workflows/release.yml
  • tests/test_integrity.py
  • tests/test_archive.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • dfetch/project/init.py
  • dfetch/manifest/project.py

@spoorcc
Copy link
Copy Markdown
Contributor Author

spoorcc commented Mar 20, 2026

@coderabbitai resume

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 20, 2026

✅ Actions performed

Reviews resumed.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
dfetch/project/archivesubproject.py (1)

217-222: The is_license parameter defaults to True in ArchiveLocalRepo.extract() - verify this is the desired behavior.

Looking at ArchiveLocalRepo.extract() in archive.py, the is_license parameter defaults to True. The call here doesn't pass it explicitly, which means license files will always be retained when src is specified. This is likely correct, but worth confirming aligns with the intended behavior for archives.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dfetch/project/archivesubproject.py` around lines 217 - 222, The call to
ArchiveLocalRepo.extract(tmp_path, self.local_path, src=self.source,
ignore=self.ignore) relies on ArchiveLocalRepo.extract (in archive.py)
defaulting is_license=True, which will always retain license files when src is
provided; confirm if that is intended, and if not, explicitly pass
is_license=False in this call (or update the default in ArchiveLocalRepo.extract
if the library-wide behavior should change) so archive extraction behavior
matches the intended license-file handling.
dfetch/commands/report.py (1)

117-122: Use the same fallback chain after a metadata read.

project.hash is only consulted when the metadata file is missing. If the file exists but tag/revision is empty, archive reports still collapse to "" even though the manifest already has usable values.

♻️ Suggested tweak
         try:
             metadata = Metadata.from_file(Metadata.from_project_entry(project).path)
-            version = metadata.tag or metadata.revision or ""
+            version = (
+                metadata.tag
+                or metadata.revision
+                or project.tag
+                or project.revision
+                or project.hash
+                or ""
+            )
         except FileNotFoundError:
             version = project.tag or project.revision or project.hash or ""
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dfetch/commands/report.py` around lines 117 - 122, The code currently sets
version = metadata.tag or metadata.revision or "" when a metadata file exists,
which ignores manifest fallbacks; update the assignment so after reading
metadata (via Metadata.from_file and Metadata.from_project_entry(project).path)
you fall back to the project's values in the same chain — e.g., use metadata.tag
or metadata.revision or project.tag or project.revision or project.hash or "" —
so that when metadata fields are empty the manifest's tag/revision/hash are
consulted before returning an empty string.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@dfetch/reporting/sbom_reporter.py`:
- Around line 251-253: The code is misusing component.version to carry the
archive integrity hash which causes hashes to be dropped when version is a
normal revision; update _apply_archive_refs (and the related code paths around
where _apply_external_references, _apply_licenses, and self._bom.components.add
are called) to accept and set an explicit archive_hash (e.g., integrity_hash)
instead of overloading component.version: populate component.version with the
real semantic revision (e.g., "1.2.3") and add the CycloneDX hash into
component.hashes (or a dedicated component.extensions/integrity field) using the
algorithm and hex value; ensure the logic that previously checked for version
like "algo:hex" now checks the explicit archive_hash parameter and always adds a
Hash object to component.hashes so both version and integrity are preserved.

---

Nitpick comments:
In `@dfetch/commands/report.py`:
- Around line 117-122: The code currently sets version = metadata.tag or
metadata.revision or "" when a metadata file exists, which ignores manifest
fallbacks; update the assignment so after reading metadata (via
Metadata.from_file and Metadata.from_project_entry(project).path) you fall back
to the project's values in the same chain — e.g., use metadata.tag or
metadata.revision or project.tag or project.revision or project.hash or "" — so
that when metadata fields are empty the manifest's tag/revision/hash are
consulted before returning an empty string.

In `@dfetch/project/archivesubproject.py`:
- Around line 217-222: The call to ArchiveLocalRepo.extract(tmp_path,
self.local_path, src=self.source, ignore=self.ignore) relies on
ArchiveLocalRepo.extract (in archive.py) defaulting is_license=True, which will
always retain license files when src is provided; confirm if that is intended,
and if not, explicitly pass is_license=False in this call (or update the default
in ArchiveLocalRepo.extract if the library-wide behavior should change) so
archive extraction behavior matches the intended license-file handling.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6f81dd25-1908-45cd-9c83-893f9ac0406d

📥 Commits

Reviewing files that changed from the base of the PR and between c8ced39 and e997e3f.

📒 Files selected for processing (17)
  • .github/workflows/release.yml
  • dfetch/commands/report.py
  • dfetch/manifest/manifest.py
  • dfetch/manifest/project.py
  • dfetch/project/__init__.py
  • dfetch/project/archivesubproject.py
  • dfetch/project/gitsubproject.py
  • dfetch/project/subproject.py
  • dfetch/project/svnsubproject.py
  • dfetch/reporting/sbom_reporter.py
  • dfetch/util/purl.py
  • dfetch/util/util.py
  • dfetch/vcs/archive.py
  • features/steps/generic_steps.py
  • tests/test_archive.py
  • tests/test_integrity.py
  • tests/test_patch.py
✅ Files skipped from review due to trivial changes (6)
  • tests/test_patch.py
  • .github/workflows/release.yml
  • dfetch/project/svnsubproject.py
  • tests/test_integrity.py
  • dfetch/manifest/manifest.py
  • tests/test_archive.py
🚧 Files skipped from review as they are similar to previous changes (7)
  • dfetch/project/gitsubproject.py
  • dfetch/project/init.py
  • dfetch/project/subproject.py
  • features/steps/generic_steps.py
  • dfetch/util/util.py
  • dfetch/util/purl.py
  • dfetch/manifest/project.py

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
dfetch/vcs/archive.py (1)

289-313: ⚠️ Potential issue | 🔴 Critical

Harden TAR extraction fallback and avoid brittle version gating for filter= support.

Line [333] uses a fixed version tuple to decide filter="tar", and Line [336] falls back to unfiltered extraction. _check_tar_members (Line [306]-[312]) only validates member paths, not symlink/hardlink targets or device members. On runtimes where filter support is absent, this leaves a real extraction safety gap.

For CPython tarfile: which versions support TarFile.extractall(filter=...), and what exactly does filter="tar" validate for symlink/hardlink/device entries?
🛠️ Proposed fix
+import inspect
@@
     def _check_tar_members(tf: tarfile.TarFile) -> None:
@@
         for member in members:
             member_path = pathlib.PurePosixPath(member.name)
             if member_path.is_absolute() or any(
                 part == ".." for part in member_path.parts
             ):
                 raise RuntimeError(
                     f"Archive contains an unsafe member path: {member.name!r}"
                 )
+            if member.islnk() or member.issym():
+                target = pathlib.PurePosixPath(member.linkname)
+                if target.is_absolute() or any(part == ".." for part in target.parts):
+                    raise RuntimeError(
+                        f"Archive contains an unsafe link target: {member.linkname!r}"
+                    )
+            if member.isdev():
+                raise RuntimeError(
+                    f"Archive contains unsupported device member: {member.name!r}"
+                )
@@
+    `@staticmethod`
+    def _supports_tar_filter() -> bool:
+        return "filter" in inspect.signature(tarfile.TarFile.extractall).parameters
+
     `@staticmethod`
     def _extract_raw(archive_path: str, dest_dir: str) -> None:
@@
-                if sys.version_info >= (3, 11, 4):
+                if ArchiveLocalRepo._supports_tar_filter():
                     tf.extractall(dest_dir, filter="tar")  # nosec B202
                 else:
                     tf.extractall(dest_dir)  # nosec B202

Also applies to: 333-336

dfetch/project/archivesubproject.py (1)

112-130: ⚠️ Potential issue | 🟠 Major

Honor the inspected/frozen revision URL instead of always using the current manifest URL.

Line [132] ignores revision, and Line [233]-[235] recomputes a hash via self._remote_repo (bound to current project.remote_url). If the manifest URL changes after a prior fetch, check/freeze can validate or pin a different archive than the one represented on disk.

🛠️ Proposed fix
-    def _download_and_compute_hash(self, algorithm: str = "sha256") -> IntegrityHash:
+    def _download_and_compute_hash(
+        self, url: str, algorithm: str = "sha256"
+    ) -> IntegrityHash:
@@
-        fd, tmp_path = tempfile.mkstemp(suffix=_suffix_for_url(self.remote))
+        fd, tmp_path = tempfile.mkstemp(suffix=_suffix_for_url(url))
         os.close(fd)
         try:
-            hex_digest = self._remote_repo.download(tmp_path, algorithm=algorithm)
+            hex_digest = ArchiveRemote(url).download(tmp_path, algorithm=algorithm)
             return IntegrityHash(algorithm, hex_digest)
@@
-    def _does_revision_exist(self, revision: str) -> bool:  # noqa: ARG002
+    def _does_revision_exist(self, revision: str) -> bool:
@@
-        return self._remote_repo.is_accessible()
+        expected = IntegrityHash.parse(revision)
+        if expected:
+            return self._remote_repo.is_accessible()
+        return ArchiveRemote(revision).is_accessible()
@@
-        pinned = IntegrityHash.parse(revision) or self._download_and_compute_hash(
-            "sha256"
-        )
+        pinned = IntegrityHash.parse(revision) or self._download_and_compute_hash(
+            revision, "sha256"
+        )

Also applies to: 132-140, 230-235

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dfetch/project/archivesubproject.py` around lines 112 - 130,
_download_and_compute_hash and the places that call into
self._remote_repo.download currently always use the repo bound to the project's
current remote_url, ignoring the inspected/frozen revision; change the logic so
the download is performed against the specific inspected/frozen revision URL (or
a RemoteRepo instance bound to that revision) instead of self._remote_repo:
locate the download call(s) (calls to self._remote_repo.download) and replace
them with a download against the revision-specific endpoint (e.g. obtain or
construct a RemoteRepo/remote client for self.revision or pass the
revision-specific URL into the download call), then compute and return
IntegrityHash as before; ensure all callers that recompute the hash use this
revision-bound download path rather than the ambient project remote_url.
🧹 Nitpick comments (1)
dfetch/util/util.py (1)

57-61: Path traversal protection is in place, but commonpath can raise on Windows.

The os.path.commonpath check correctly prevents src from escaping src_root. However, on Windows, commonpath raises ValueError when paths are on different drives (e.g., C:\ vs D:\). While this edge case is unlikely in practice (archive extraction typically stays on one drive), it could cause an unhandled exception rather than a clear error.

🛡️ Optional: Handle ValueError for clearer error messaging
     resolved_src_root = os.path.realpath(src_root)
     src_path = os.path.join(src_root, src)
     resolved_src_path = os.path.realpath(src_path)
-    if os.path.commonpath([resolved_src_root, resolved_src_path]) != resolved_src_root:
+    try:
+        within_root = os.path.commonpath([resolved_src_root, resolved_src_path]) == resolved_src_root
+    except ValueError:
+        within_root = False
+    if not within_root:
         raise RuntimeError(f"src {src!r} escapes the source root")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dfetch/util/util.py` around lines 57 - 61, Wrap the os.path.commonpath check
that compares resolved_src_root and resolved_src_path in a try/except for
ValueError (thrown on Windows when paths are on different drives) inside the
same block where resolved_src_root/resolved_src_path are computed; if ValueError
occurs, raise a RuntimeError like the existing one (e.g., "src ... escapes the
source root") so the function (using variables resolved_src_root,
resolved_src_path, commonpath, src_root, src) produces a clear, handled error
instead of an unhandled ValueError.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@dfetch/vcs/archive.py`:
- Around line 100-115: _is_http_reachable currently treats 3xx as reachable
while _http_download only accepts status 200, so update _http_download to follow
redirects: when _http_download (and its helper _http_conn) receives a 3xx
status, read the Location header, resolve it against the original URL (use
urllib.parse.urljoin to handle relative locations), and retry the request
against the new URL; implement this as a loop or tail-recursive call with a max
redirect limit (e.g., 5) to avoid infinite loops and carry forward necessary
headers, and ensure errors are raised if Location is missing or max redirects
exceeded; keep _is_http_reachable behavior unchanged or tighten it to consider
3xx unreachable if you prefer, but the primary fix is adding redirect handling
in _http_download.

---

Duplicate comments:
In `@dfetch/project/archivesubproject.py`:
- Around line 112-130: _download_and_compute_hash and the places that call into
self._remote_repo.download currently always use the repo bound to the project's
current remote_url, ignoring the inspected/frozen revision; change the logic so
the download is performed against the specific inspected/frozen revision URL (or
a RemoteRepo instance bound to that revision) instead of self._remote_repo:
locate the download call(s) (calls to self._remote_repo.download) and replace
them with a download against the revision-specific endpoint (e.g. obtain or
construct a RemoteRepo/remote client for self.revision or pass the
revision-specific URL into the download call), then compute and return
IntegrityHash as before; ensure all callers that recompute the hash use this
revision-bound download path rather than the ambient project remote_url.

---

Nitpick comments:
In `@dfetch/util/util.py`:
- Around line 57-61: Wrap the os.path.commonpath check that compares
resolved_src_root and resolved_src_path in a try/except for ValueError (thrown
on Windows when paths are on different drives) inside the same block where
resolved_src_root/resolved_src_path are computed; if ValueError occurs, raise a
RuntimeError like the existing one (e.g., "src ... escapes the source root") so
the function (using variables resolved_src_root, resolved_src_path, commonpath,
src_root, src) produces a clear, handled error instead of an unhandled
ValueError.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: bbf974a2-99f1-4590-bc1a-76b7673a054d

📥 Commits

Reviewing files that changed from the base of the PR and between af242fd and 0b737fb.

📒 Files selected for processing (12)
  • dfetch/project/archivesubproject.py
  • dfetch/project/subproject.py
  • dfetch/reporting/sbom_reporter.py
  • dfetch/util/purl.py
  • dfetch/util/util.py
  • dfetch/vcs/archive.py
  • dfetch/vcs/integrity_hash.py
  • features/steps/generic_steps.py
  • tests/test_archive.py
  • tests/test_integrity_hash.py
  • tests/test_subproject.py
  • tests/test_util.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • features/steps/generic_steps.py
  • dfetch/util/purl.py

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
dfetch/util/util.py (1)

57-61: ⚠️ Potential issue | 🟠 Major

Convert commonpath() drive mismatches into your RuntimeError.

Line 60 still raises ValueError on Windows when src is absolute on another drive. That bypasses the intended “escapes the source root” path and won’t be caught by catch_runtime_exceptions(), which only handles RuntimeError.

Suggested fix
     resolved_src_root = os.path.realpath(src_root)
     src_path = os.path.join(src_root, src)
     resolved_src_path = os.path.realpath(src_path)
-    if os.path.commonpath([resolved_src_root, resolved_src_path]) != resolved_src_root:
+    try:
+        within_root = (
+            os.path.commonpath([resolved_src_root, resolved_src_path])
+            == resolved_src_root
+        )
+    except ValueError:
+        within_root = False
+    if not within_root:
         raise RuntimeError(f"src {src!r} escapes the source root")

You can verify the stdlib behavior with:

#!/bin/bash
python - <<'PY'
import ntpath

root = ntpath.realpath(r"C:\extracted")
outside = ntpath.realpath(r"D:\outside")

try:
    print(ntpath.commonpath([root, outside]))
except Exception as exc:
    print(type(exc).__name__, exc)
PY
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dfetch/util/util.py` around lines 57 - 61, The code compares
resolved_src_root and resolved_src_path using os.path.commonpath which on
Windows can raise ValueError for paths on different drives, bypassing the
intended RuntimeError; wrap the commonpath check in a try/except ValueError (or
handle exceptions from os.path.commonpath) inside the function where
resolved_src_root/resolved_src_path are computed and, if a ValueError occurs or
the commonpath check fails, raise the same RuntimeError(f"src {src!r} escapes
the source root") so drive-mismatch cases are normalized into the existing
RuntimeError path (locate the block that computes resolved_src_root, src_path,
resolved_src_path and the if os.path.commonpath(...) check).
dfetch/project/subproject.py (1)

418-429: ⚠️ Potential issue | 🟠 Major

Honor revision_is_enough() before treating the manifest as already frozen.

This early return now ignores branch unconditionally. For VCSes where branch is part of identity, branch=A, revision=123 and branch=B, revision=123 are not equivalent, so freeze can skip rewriting a stale manifest entry.

Suggested fix
         on_disk_version = self.on_disk_version()
-        if (
-            on_disk_version
-            and project.version.tag == on_disk_version.tag
-            and project.version.revision == on_disk_version.revision
-        ):
-            return None
-        if on_disk_version:
-            project.version = on_disk_version
-            return (
-                on_disk_version.revision or on_disk_version.tag or str(on_disk_version)
-            )
-        return None
+        if not on_disk_version:
+            return None
+
+        current_version = project.version
+        if current_version.tag and current_version.tag == on_disk_version.tag:
+            return None
+        if (
+            current_version.revision
+            and self.revision_is_enough()
+            and current_version.revision == on_disk_version.revision
+        ):
+            return None
+        if current_version == on_disk_version:
+            return None
+
+        project.version = on_disk_version
+        return (
+            on_disk_version.revision or on_disk_version.tag or str(on_disk_version)
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dfetch/project/subproject.py` around lines 418 - 429, The early return that
treats the on-disk manifest as identical should respect the version's
revision_is_enough() policy: when you compare on_disk_version and
project.version (the block using on_disk_version = self.on_disk_version() and
the equality check on tag/revision), only return None if the versions match AND
the version's revision_is_enough() indicates the revision alone is sufficient;
otherwise fall through so the manifest can be rewritten. Update the condition to
call revision_is_enough() on the appropriate Version object (e.g.,
on_disk_version.revision_is_enough() or project.version.revision_is_enough())
and require it true before skipping the update.
🧹 Nitpick comments (2)
example/dfetch.yaml (1)

47-48: The destination path still looks like a cpputest copy/paste.

name and repo-path point to cppcheck, but dst says Tests/cpputest-archive. Renaming it would make the example self-consistent.

Suggested change
-    dst: Tests/cpputest-archive
+    dst: Tests/cppcheck-archive
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@example/dfetch.yaml` around lines 47 - 48, The dst value is a leftover
cpputest copy/paste while repo-path (and name) reference cppcheck; update the
dst field to match the repo (e.g., change dst: Tests/cpputest-archive to dst:
Tests/cppcheck-archive or another cppcheck-consistent path) so the dfetch entry
is self-consistent with repo-path/name.
tests/test_update.py (1)

77-80: This assertion no longer exercises the callback contract.

ANY only proves that some callable was passed. It won’t catch regressions where the closure uses the wrong destination or stops delegating to superproject.ignored_files(). Consider extracting call_args.kwargs["ignored_files_callback"], invoking it, and asserting the resulting ignored_files(...) call.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_update.py` around lines 77 - 80, The test currently uses ANY for
ignored_files_callback which only verifies a callable was passed; instead, grab
the actual callback from
mocked_create.return_value.update.call_args.kwargs["ignored_files_callback"],
invoke that callback with the same parameters the production code would pass,
and assert it delegates to superproject.ignored_files(...) (e.g., by checking
the superproject mock received the expected call). Update the assertion block
around mocked_create.return_value.update to extract call_args, call the
ignored_files_callback, and assert the superproject.ignored_files mock was
called with the expected arguments.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@dfetch/util/util.py`:
- Around line 182-205: The change to hash_directory() will change computed
hashes and break SubProject._are_there_local_changes() comparisons against
existing .dfetch_data.yaml entries; to fix, add a stable versioning prefix to
the directory hash (e.g. "v1:" / "v2:") returned by hash_directory() and update
SubProject._are_there_local_changes() to detect older unprefixed hashes and
either compute a legacy hash (call the previous hash logic as a fallback) or
accept both formats while writing new metadata always with the versioned format;
also add a small migration path that, when an old-format hash is encountered,
updates .dfetch_data.yaml to the new versioned hash after successful
verification so users aren’t forced to refetch.

---

Duplicate comments:
In `@dfetch/project/subproject.py`:
- Around line 418-429: The early return that treats the on-disk manifest as
identical should respect the version's revision_is_enough() policy: when you
compare on_disk_version and project.version (the block using on_disk_version =
self.on_disk_version() and the equality check on tag/revision), only return None
if the versions match AND the version's revision_is_enough() indicates the
revision alone is sufficient; otherwise fall through so the manifest can be
rewritten. Update the condition to call revision_is_enough() on the appropriate
Version object (e.g., on_disk_version.revision_is_enough() or
project.version.revision_is_enough()) and require it true before skipping the
update.

In `@dfetch/util/util.py`:
- Around line 57-61: The code compares resolved_src_root and resolved_src_path
using os.path.commonpath which on Windows can raise ValueError for paths on
different drives, bypassing the intended RuntimeError; wrap the commonpath check
in a try/except ValueError (or handle exceptions from os.path.commonpath) inside
the function where resolved_src_root/resolved_src_path are computed and, if a
ValueError occurs or the commonpath check fails, raise the same
RuntimeError(f"src {src!r} escapes the source root") so drive-mismatch cases are
normalized into the existing RuntimeError path (locate the block that computes
resolved_src_root, src_path, resolved_src_path and the if
os.path.commonpath(...) check).

---

Nitpick comments:
In `@example/dfetch.yaml`:
- Around line 47-48: The dst value is a leftover cpputest copy/paste while
repo-path (and name) reference cppcheck; update the dst field to match the repo
(e.g., change dst: Tests/cpputest-archive to dst: Tests/cppcheck-archive or
another cppcheck-consistent path) so the dfetch entry is self-consistent with
repo-path/name.

In `@tests/test_update.py`:
- Around line 77-80: The test currently uses ANY for ignored_files_callback
which only verifies a callable was passed; instead, grab the actual callback
from
mocked_create.return_value.update.call_args.kwargs["ignored_files_callback"],
invoke that callback with the same parameters the production code would pass,
and assert it delegates to superproject.ignored_files(...) (e.g., by checking
the superproject mock received the expected call). Update the assertion block
around mocked_create.return_value.update to extract call_args, call the
ignored_files_callback, and assert the superproject.ignored_files mock was
called with the expected arguments.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 91c98156-c875-4b97-8b43-9f4a7d812b95

📥 Commits

Reviewing files that changed from the base of the PR and between 2511c76 and 3735bc4.

📒 Files selected for processing (11)
  • dfetch/commands/update.py
  • dfetch/commands/update_patch.py
  • dfetch/project/subproject.py
  • dfetch/util/util.py
  • dfetch/vcs/archive.py
  • example/dfetch.yaml
  • features/check-archive.feature
  • features/report-sbom.feature
  • tests/test_subproject.py
  • tests/test_update.py
  • tests/test_util.py
✅ Files skipped from review due to trivial changes (2)
  • tests/test_util.py
  • dfetch/vcs/archive.py
🚧 Files skipped from review as they are similar to previous changes (4)
  • features/report-sbom.feature
  • dfetch/commands/update.py
  • tests/test_subproject.py
  • features/check-archive.feature

@spoorcc
Copy link
Copy Markdown
Contributor Author

spoorcc commented Mar 21, 2026

@coderabbitai resume

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 21, 2026

✅ Actions performed

Reviews resumed.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
dfetch/vcs/archive.py (1)

316-357: ⚠️ Potential issue | 🔴 Critical

Reject TAR links and special members before the bare extractall() branch.

_check_tar_members() only validates member.name. On the fallback path that still calls unfiltered tf.extractall(dest_dir), symlinks, hardlinks, FIFOs, and device nodes can still be materialized from an untrusted archive. If we keep this compatibility path, those member types need to be blocked (or fully validated) before extraction.

🔒 Proposed hardening
     for member in members:
         ArchiveLocalRepo._check_archive_member_path(member.name)
+        if (
+            member.issym()
+            or member.islnk()
+            or member.ischr()
+            or member.isblk()
+            or member.isfifo()
+        ):
+            raise RuntimeError(
+                f"Archive contains an unsupported special member: {member.name!r}"
+            )

This verifies that the repo still supports interpreters that can take the unfiltered branch and that the current guard still lacks special-member checks. Expected result: the supported Python range includes the fallback path, and the snippet shows no issym()/islnk()/ischr()/isblk()/isfifo() checks before extractall().

#!/bin/bash
set -eu
echo "=== Supported Python versions ==="
rg -n 'requires-python|python_requires' pyproject.toml setup.cfg setup.py 2>/dev/null || true

echo
echo "=== Current TAR validation/extraction path ==="
sed -n '316,357p' dfetch/vcs/archive.py
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dfetch/vcs/archive.py` around lines 316 - 357, _check_tar_members currently
only validates member.name, but when _extract_raw takes the unfiltered fallback
path (tf.extractall without filter="tar") special members like symlinks,
hardlinks, character/block devices and FIFOs can be materialized; update
ArchiveLocalRepo._check_tar_members (called from _extract_raw) to also reject
members where member.issym(), member.islnk(), member.ischr(), member.isblk(), or
member.isfifo() and raise a RuntimeError with a clear message so the unfiltered
extraction path is safe on older Python versions; keep the existing
ArchiveLocalRepo._check_archive_limits and member.name checks and ensure the new
checks run before tf.extractall in _extract_raw.
dfetch/project/archivesubproject.py (1)

112-125: ⚠️ Potential issue | 🟠 Major

Hash the on-disk archive URL when freezing URL-pinned entries.

When on_disk.revision is a plain URL, freeze_project() falls back to _download_and_compute_hash("sha256"), but that helper is hard-wired to self.remote. If the manifest URL changes after the last fetch, freeze will pin the hash of the new URL instead of the archive recorded in metadata.

🧭 Proposed fix
-    def _download_and_compute_hash(self, algorithm: str = "sha256") -> IntegrityHash:
+    def _download_and_compute_hash(
+        self, url: str, algorithm: str = "sha256"
+    ) -> IntegrityHash:
         """Download the archive to a temporary file and return its :class:`IntegrityHash`."""
-        fd, tmp_path = tempfile.mkstemp(suffix=_suffix_for_url(self.remote))
+        fd, tmp_path = tempfile.mkstemp(suffix=_suffix_for_url(url))
         os.close(fd)
         try:
-            hex_digest = self._remote_repo.download(tmp_path, algorithm=algorithm)
+            hex_digest = ArchiveRemote(url).download(tmp_path, algorithm=algorithm)
             return IntegrityHash(algorithm, hex_digest)
@@
-        pinned = IntegrityHash.parse(revision) or self._download_and_compute_hash(
-            "sha256"
-        )
+        pinned = IntegrityHash.parse(revision) or self._download_and_compute_hash(
+            revision, "sha256"
+        )

Also applies to: 231-236

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dfetch/project/archivesubproject.py` around lines 112 - 125, The helper
_download_and_compute_hash currently always downloads from self.remote, causing
freeze_project to hash the current manifest URL instead of the archive URL
recorded in metadata; change _download_and_compute_hash to accept an explicit
url parameter (e.g., url: str = None) and use that URL when provided (fall back
to self.remote otherwise), update calls from freeze_project and the other
occurrences (the calls around lines 231-236) to pass the on-disk/archive URL
(the stored metadata URL) so the hash is computed against the original archived
URL rather than the possibly changed self.remote.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@dfetch/vcs/archive.py`:
- Around line 94-96: The code uses urllib.parse.urlparse(self.url) and directly
passes parsed.path to os.path.exists and open, which fails for percent-encoded
file URLs; decode the URL path by applying urllib.parse.unquote(parsed.path) and
then convert to a platform pathname with urllib.request.url2pathname(...), and
use that resulting path for filesystem calls (replace usages of parsed.path in
the existence check and in open/read operations); update the checks around
parsed = urllib.parse.urlparse(self.url) and any subsequent references (e.g.,
where os.path.exists(parsed.path) and open(parsed.path, "rb") are called) to use
the decoded path variable instead.

---

Duplicate comments:
In `@dfetch/project/archivesubproject.py`:
- Around line 112-125: The helper _download_and_compute_hash currently always
downloads from self.remote, causing freeze_project to hash the current manifest
URL instead of the archive URL recorded in metadata; change
_download_and_compute_hash to accept an explicit url parameter (e.g., url: str =
None) and use that URL when provided (fall back to self.remote otherwise),
update calls from freeze_project and the other occurrences (the calls around
lines 231-236) to pass the on-disk/archive URL (the stored metadata URL) so the
hash is computed against the original archived URL rather than the possibly
changed self.remote.

In `@dfetch/vcs/archive.py`:
- Around line 316-357: _check_tar_members currently only validates member.name,
but when _extract_raw takes the unfiltered fallback path (tf.extractall without
filter="tar") special members like symlinks, hardlinks, character/block devices
and FIFOs can be materialized; update ArchiveLocalRepo._check_tar_members
(called from _extract_raw) to also reject members where member.issym(),
member.islnk(), member.ischr(), member.isblk(), or member.isfifo() and raise a
RuntimeError with a clear message so the unfiltered extraction path is safe on
older Python versions; keep the existing ArchiveLocalRepo._check_archive_limits
and member.name checks and ensure the new checks run before tf.extractall in
_extract_raw.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: cf232bd8-d2fc-451a-9e1d-d638564f930f

📥 Commits

Reviewing files that changed from the base of the PR and between 2511c76 and 90a1fee.

📒 Files selected for processing (26)
  • CHANGELOG.rst
  • dfetch/commands/format_patch.py
  • dfetch/commands/update.py
  • dfetch/commands/update_patch.py
  • dfetch/manifest/parse.py
  • dfetch/manifest/schema.py
  • dfetch/project/archivesubproject.py
  • dfetch/project/gitsuperproject.py
  • dfetch/project/subproject.py
  • dfetch/project/superproject.py
  • dfetch/project/svnsuperproject.py
  • dfetch/util/util.py
  • dfetch/vcs/archive.py
  • doc/manifest.rst
  • example/dfetch.yaml
  • features/check-archive.feature
  • features/fetch-archive.feature
  • features/report-sbom.feature
  • features/steps/archive_steps.py
  • features/steps/generic_steps.py
  • features/steps/manifest_steps.py
  • features/validate-manifest.feature
  • tests/test_patch.py
  • tests/test_subproject.py
  • tests/test_update.py
  • tests/test_util.py
✅ Files skipped from review due to trivial changes (5)
  • CHANGELOG.rst
  • tests/test_util.py
  • features/check-archive.feature
  • tests/test_subproject.py
  • example/dfetch.yaml
🚧 Files skipped from review as they are similar to previous changes (11)
  • tests/test_patch.py
  • tests/test_update.py
  • features/validate-manifest.feature
  • dfetch/commands/update_patch.py
  • features/steps/manifest_steps.py
  • features/steps/generic_steps.py
  • dfetch/util/util.py
  • features/report-sbom.feature
  • doc/manifest.rst
  • features/steps/archive_steps.py
  • dfetch/manifest/schema.py

@spoorcc spoorcc force-pushed the claude/implement-issue-23-fmipS branch from d017798 to 4aa1901 Compare March 22, 2026 11:59
…fixes

Security improvements:
- archive.py: Add `_check_tar_member_type()` to validate TAR members on ALL
  Python versions (not just 3.11.4+). Rejects symlinks with absolute/escaping
  targets, hardlinks with path-traversal targets, device files (chr/blk), and
  FIFOs. The `filter="tar"` on Python ≥ 3.11.4 now acts as defence-in-depth
  rather than the sole guard.

Bug fixes:
- log.py: Escape user-supplied strings (project names, info messages) with
  `rich.markup.escape()` before embedding them in Rich markup templates.
  Prevents `rich.errors.MarkupError` when project or remote names contain
  Rich markup syntax (e.g. `[/`). Fixes pre-existing crash in fuzzing tests.

Platform / correctness fixes:
- test_archive.py: Use `pathlib.Path.as_uri()` for RFC-compliant file URIs
  instead of manual string manipulation that is not portable on Windows.
- generic_steps.py: Use `pathlib.Path.as_posix()` in `remote_server_path()`
  instead of `os.sep`-based split/join, which could fail with mixed path
  separators on Windows.

Tests:
- Add 12 new unit tests for `_check_tar_member_type()` covering safe
  symlinks, absolute symlinks, `..` symlinks, safe/escaping hardlinks,
  char/block device files, FIFOs, and integration with `_check_tar_members`.

https://claude.ai/code/session_0169NNA4kGKpDkZVdLQE7VXn
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
dfetch/vcs/archive.py (1)

94-96: ⚠️ Potential issue | 🟠 Major

Decode file:// URIs before filesystem access.

urlparse(...).path is still percent-encoded and stays in URI form on Windows (/C:/..., UNC hosts, etc.), so valid Path(...).as_uri() inputs can look missing or fail to open. That breaks both is_accessible() and the file-download branch. A small helper that combines parsed.netloc + parsed.path, then runs urllib.parse.unquote() and urllib.request.url2pathname(), will fix both call sites.

Also applies to: 141-150

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dfetch/vcs/archive.py` around lines 94 - 96, The file:// handling uses
parsed.path directly which can be percent-encoded and Windows/UNC formatted; add
a helper (e.g., _file_uri_to_path or decode_file_uri) that takes
urllib.parse.urlparse(self.url), joins parsed.netloc + parsed.path, runs
urllib.parse.unquote() then urllib.request.url2pathname() to produce a proper
filesystem path, and use that helper in the is_accessible code path (where
parsed.scheme == "file") and in the file-download branch (the code around the
later file-download block) so both existence checks and file opens use the
decoded OS path.
🧹 Nitpick comments (5)
features/steps/archive_steps.py (2)

25-33: Set deterministic metadata on TarInfo for reproducible archives.

The TarInfo object defaults to the current timestamp and system uid/gid, which means archives created at different times will have different hashes. If deterministic hashing is required (as mentioned in PR objectives), set fixed values for mtime, uid, gid, uname, and gname.

♻️ Suggested fix for deterministic tar.gz
 def create_tar_gz(archive_path: str, name: str, files: list[dict]) -> None:
     """Create a .tar.gz archive with files nested under a top-level <name>/ directory."""
     with tarfile.open(archive_path, "w:gz") as tar:
         for file in files:
             content = f"Generated file {file['path']}\n".encode()
             member_path = f"{name}/{file['path']}"
             info = tarfile.TarInfo(name=member_path)
             info.size = len(content)
+            info.mtime = 0  # Epoch for reproducibility
+            info.uid = 0
+            info.gid = 0
+            info.uname = ""
+            info.gname = ""
             tar.addfile(info, io.BytesIO(content))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@features/steps/archive_steps.py` around lines 25 - 33, The TarInfo objects in
create_tar_gz are using system defaults (timestamp, uid/gid, names) causing
nondeterministic archives; update the create_tar_gz function so after creating
info = tarfile.TarInfo(name=member_path) you set deterministic metadata (e.g.,
info.mtime = 0 or a fixed epoch, info.uid = 0, info.gid = 0, info.uname = "",
info.gname = "") and ensure info.mode is a stable value; this will produce
reproducible .tar.gz outputs across runs.

36-42: Consider setting deterministic timestamps for zip entries.

Similar to the tar.gz case, writestr uses the current time by default. For deterministic archives, use ZipInfo with a fixed date_time.

♻️ Suggested fix for deterministic zip
 def create_zip(archive_path: str, name: str, files: list[dict]) -> None:
     """Create a .zip archive with files nested under a top-level <name>/ directory."""
     with zipfile.ZipFile(archive_path, "w", zipfile.ZIP_DEFLATED) as zf:
         for file in files:
             content = f"Generated file {file['path']}\n"
             member_path = f"{name}/{file['path']}"
-            zf.writestr(member_path, content)
+            zip_info = zipfile.ZipInfo(member_path, date_time=(1980, 1, 1, 0, 0, 0))
+            zf.writestr(zip_info, content)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@features/steps/archive_steps.py` around lines 36 - 42, The archive entries
created in create_zip use writestr which embeds the current timestamp; make
entries deterministic by creating a ZipInfo for each member_path and set its
date_time to a fixed tuple (e.g., (1980, 1, 1, 0, 0, 0)) before calling writestr
with that ZipInfo and the content; update create_zip to build ZipInfo per file
(using member_path) and pass it to zipfile.ZipFile.writestr so all zip entries
have a stable timestamp.
features/steps/generic_steps.py (1)

111-134: Greedy list matching may produce counterintuitive results in edge cases.

The list subset matching is order-dependent and greedy: if an early exp_item matches multiple actual items, it claims the first match, potentially blocking a later exp_item that could only match that specific item.

For example, with expected = [{"a": 1}, {"a": 1, "b": 2}] and actual = [{"a": 1, "b": 2}], the first expected item greedily matches the only actual item, leaving the second expected item unmatched—even though a smarter algorithm would match the more specific item first.

This is acceptable for test code where inputs are controlled, but worth documenting in the docstring.

📝 Suggested docstring clarification
 def _json_subset_matches(expected, actual) -> bool:
-    """Return *True* when *expected* is a subset of *actual* (recursive)."""
+    """Return *True* when *expected* is a subset of *actual* (recursive).
+
+    For lists, each item in *expected* must match a distinct item in *actual*
+    using greedy first-match semantics. This may produce false negatives if
+    multiple expected items could match the same actual item.
+    """
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@features/steps/generic_steps.py` around lines 111 - 134, The
_json_subset_matches function's list-matching is order-dependent and uses a
greedy algorithm that can produce counterintuitive results (an earlier expected
item can steal the only matching actual item needed by a later, more specific
expected item); update the function docstring to explicitly document this
limitation: state that list matching is greedy and order-sensitive, give a
concise example (e.g., expected = [{"a":1}, {"a":1,"b":2}] vs actual =
[{"a":1,"b":2}]) and note that consumers should not rely on non-deterministic
matching or should pre-order expected lists to avoid this behavior; reference
the function name _json_subset_matches so reviewers can find it.
features/steps/manifest_steps.py (1)

15-28: Consider consolidating placeholder substitution logic.

The archive placeholder substitution (<archive-sha256>, <archive-sha384>, <archive-sha512>, <archive-url>) is duplicated in _apply_context_substitutions in generic_steps.py. Consider extracting a shared helper (e.g., in generic_steps.py) that handles the common archive placeholders, then have apply_manifest_substitutions call it and add the manifest-specific URL replacement.

♻️ Suggested refactor to reduce duplication

In generic_steps.py, export _apply_context_substitutions (rename to apply_archive_substitutions), then in this file:

-def apply_manifest_substitutions(context, contents: str) -> str:
-    """Apply context-specific substitutions to manifest contents."""
-    result = contents.replace(
-        "url: some-remote-server", f"url: file:///{remote_server_path(context)}"
-    )
-    if hasattr(context, "archive_sha256"):
-        result = result.replace("<archive-sha256>", context.archive_sha256)
-    if hasattr(context, "archive_sha384"):
-        result = result.replace("<archive-sha384>", context.archive_sha384)
-    if hasattr(context, "archive_sha512"):
-        result = result.replace("<archive-sha512>", context.archive_sha512)
-    if hasattr(context, "archive_url"):
-        result = result.replace("<archive-url>", context.archive_url)
-    return result
+from features.steps.generic_steps import apply_archive_substitutions
+
+def apply_manifest_substitutions(context, contents: str) -> str:
+    """Apply context-specific substitutions to manifest contents."""
+    result = contents.replace(
+        "url: some-remote-server", f"url: file:///{remote_server_path(context)}"
+    )
+    return apply_archive_substitutions(result, context)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@features/steps/manifest_steps.py` around lines 15 - 28, The archive
placeholder replacements in apply_manifest_substitutions duplicate logic in
_apply_context_substitutions; refactor by extracting the archive-related
replacement logic into a shared helper (rename/export
_apply_context_substitutions as apply_archive_substitutions in generic_steps.py)
and then have apply_manifest_substitutions call that helper to perform the
<archive-*> substitutions before applying its manifest-specific "url:
some-remote-server" → file:///... replacement; update imports/usage so
apply_manifest_substitutions uses apply_archive_substitutions(context, contents)
and only performs the manifest URL substitution itself.
example/dfetch.yaml (1)

45-53: Consider adding explicit vcs: archive for consistency.

The example demonstrates archive integrity verification, but vcs: archive is not explicitly specified. While dfetch may infer the VCS type from the archive format, all documented examples and test cases in the codebase explicitly specify vcs: archive. Adding it improves clarity and consistency.

💡 Suggested improvement
  - name: cppcheck-archive
    remote: github
    dst: Tests/cppcheck-archive
    repo-path: danmar/cppcheck/archive/2.20.0.tar.gz
+   vcs: archive
    ignore:
       - tests
       - .github
    integrity:
      hash: sha256:7be7992439339017edb551d8e7d2315f9bb57c402da50c2cee9cd0e2724600a1
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@example/dfetch.yaml` around lines 45 - 53, Add an explicit "vcs: archive"
field to the dfetch.yaml entry named "cppcheck-archive" so the configuration
matches other examples and tests; update the block that currently has name:
cppcheck-archive, remote: github, dst: Tests/cppcheck-archive, repo-path:
danmar/cppcheck/archive/2.20.0.tar.gz, ignore: ... and integrity: ... by
inserting vcs: archive alongside the other keys to make the VCS type explicit
and consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@dfetch/util/util.py`:
- Around line 89-105: The pruning step currently dedupes by resolved target and
skips broken symlinks, causing matched symlink entries to survive; update the
loop over find_matching_files(directory, patterns) to deduplicate by the path
being removed (use the path string rather than .resolve()) and append the
original Path objects to paths, keep the reverse-length sort, and before calling
safe_rm(file_or_dir) check existence with os.path.lexists(str(file_or_dir))
instead of file_or_dir.exists(); reference functions/variables:
find_matching_files, is_license_file, safe_rm, and the seen/paths collections.

In `@dfetch/vcs/archive.py`:
- Around line 69-71: The URL classification in is_archive_url currently checks
the raw URL including query strings, causing URLs like
".../pkg.tar.gz?download=1" to not be detected; update is_archive_url to strip
query strings and fragments before testing extensions (e.g. parse the URL and
use the path component or split on '?'/'#'), then run lower().endswith(ext)
against ARCHIVE_EXTENSIONS so ArchiveSubProject.check() and other callers agree
on archive detection.

---

Duplicate comments:
In `@dfetch/vcs/archive.py`:
- Around line 94-96: The file:// handling uses parsed.path directly which can be
percent-encoded and Windows/UNC formatted; add a helper (e.g., _file_uri_to_path
or decode_file_uri) that takes urllib.parse.urlparse(self.url), joins
parsed.netloc + parsed.path, runs urllib.parse.unquote() then
urllib.request.url2pathname() to produce a proper filesystem path, and use that
helper in the is_accessible code path (where parsed.scheme == "file") and in the
file-download branch (the code around the later file-download block) so both
existence checks and file opens use the decoded OS path.

---

Nitpick comments:
In `@example/dfetch.yaml`:
- Around line 45-53: Add an explicit "vcs: archive" field to the dfetch.yaml
entry named "cppcheck-archive" so the configuration matches other examples and
tests; update the block that currently has name: cppcheck-archive, remote:
github, dst: Tests/cppcheck-archive, repo-path:
danmar/cppcheck/archive/2.20.0.tar.gz, ignore: ... and integrity: ... by
inserting vcs: archive alongside the other keys to make the VCS type explicit
and consistent.

In `@features/steps/archive_steps.py`:
- Around line 25-33: The TarInfo objects in create_tar_gz are using system
defaults (timestamp, uid/gid, names) causing nondeterministic archives; update
the create_tar_gz function so after creating info =
tarfile.TarInfo(name=member_path) you set deterministic metadata (e.g.,
info.mtime = 0 or a fixed epoch, info.uid = 0, info.gid = 0, info.uname = "",
info.gname = "") and ensure info.mode is a stable value; this will produce
reproducible .tar.gz outputs across runs.
- Around line 36-42: The archive entries created in create_zip use writestr
which embeds the current timestamp; make entries deterministic by creating a
ZipInfo for each member_path and set its date_time to a fixed tuple (e.g.,
(1980, 1, 1, 0, 0, 0)) before calling writestr with that ZipInfo and the
content; update create_zip to build ZipInfo per file (using member_path) and
pass it to zipfile.ZipFile.writestr so all zip entries have a stable timestamp.

In `@features/steps/generic_steps.py`:
- Around line 111-134: The _json_subset_matches function's list-matching is
order-dependent and uses a greedy algorithm that can produce counterintuitive
results (an earlier expected item can steal the only matching actual item needed
by a later, more specific expected item); update the function docstring to
explicitly document this limitation: state that list matching is greedy and
order-sensitive, give a concise example (e.g., expected = [{"a":1},
{"a":1,"b":2}] vs actual = [{"a":1,"b":2}]) and note that consumers should not
rely on non-deterministic matching or should pre-order expected lists to avoid
this behavior; reference the function name _json_subset_matches so reviewers can
find it.

In `@features/steps/manifest_steps.py`:
- Around line 15-28: The archive placeholder replacements in
apply_manifest_substitutions duplicate logic in _apply_context_substitutions;
refactor by extracting the archive-related replacement logic into a shared
helper (rename/export _apply_context_substitutions as
apply_archive_substitutions in generic_steps.py) and then have
apply_manifest_substitutions call that helper to perform the <archive-*>
substitutions before applying its manifest-specific "url: some-remote-server" →
file:///... replacement; update imports/usage so apply_manifest_substitutions
uses apply_archive_substitutions(context, contents) and only performs the
manifest URL substitution itself.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4a469c40-3eb8-450c-b3d3-47cfc5d2c330

📥 Commits

Reviewing files that changed from the base of the PR and between cd1f807 and a2a464c.

📒 Files selected for processing (45)
  • .github/workflows/release.yml
  • CHANGELOG.rst
  • dfetch/commands/check.py
  • dfetch/commands/format_patch.py
  • dfetch/commands/freeze.py
  • dfetch/commands/report.py
  • dfetch/commands/update.py
  • dfetch/commands/update_patch.py
  • dfetch/log.py
  • dfetch/manifest/manifest.py
  • dfetch/manifest/parse.py
  • dfetch/manifest/project.py
  • dfetch/manifest/schema.py
  • dfetch/project/__init__.py
  • dfetch/project/archivesubproject.py
  • dfetch/project/gitsubproject.py
  • dfetch/project/gitsuperproject.py
  • dfetch/project/subproject.py
  • dfetch/project/superproject.py
  • dfetch/project/svnsubproject.py
  • dfetch/project/svnsuperproject.py
  • dfetch/reporting/sbom_reporter.py
  • dfetch/util/purl.py
  • dfetch/util/util.py
  • dfetch/vcs/archive.py
  • dfetch/vcs/integrity_hash.py
  • doc/manifest.rst
  • example/dfetch.yaml
  • features/check-archive.feature
  • features/fetch-archive.feature
  • features/freeze-archive.feature
  • features/report-sbom-archive.feature
  • features/report-sbom.feature
  • features/steps/archive_steps.py
  • features/steps/generic_steps.py
  • features/steps/manifest_steps.py
  • features/validate-manifest.feature
  • tests/test_archive.py
  • tests/test_integrity.py
  • tests/test_integrity_hash.py
  • tests/test_patch.py
  • tests/test_purl.py
  • tests/test_subproject.py
  • tests/test_update.py
  • tests/test_util.py
✅ Files skipped from review due to trivial changes (11)
  • .github/workflows/release.yml
  • tests/test_patch.py
  • dfetch/project/svnsubproject.py
  • features/freeze-archive.feature
  • CHANGELOG.rst
  • features/report-sbom-archive.feature
  • features/check-archive.feature
  • tests/test_integrity.py
  • features/fetch-archive.feature
  • tests/test_integrity_hash.py
  • dfetch/reporting/sbom_reporter.py
🚧 Files skipped from review as they are similar to previous changes (16)
  • dfetch/project/gitsuperproject.py
  • dfetch/commands/check.py
  • dfetch/project/svnsuperproject.py
  • tests/test_update.py
  • dfetch/commands/update.py
  • dfetch/project/init.py
  • dfetch/project/superproject.py
  • features/report-sbom.feature
  • dfetch/manifest/schema.py
  • tests/test_purl.py
  • dfetch/manifest/manifest.py
  • dfetch/project/gitsubproject.py
  • tests/test_util.py
  • dfetch/commands/format_patch.py
  • features/validate-manifest.feature
  • dfetch/manifest/parse.py

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
dfetch/log.py (1)

146-152: ⚠️ Potential issue | 🟡 Minor

Printf-style record.args not escaped.

The filter escapes record.msg but leaves record.args intact. When the logging framework later formats the message (e.g., "Patching: %s" % (filename,)), the substituted values bypass escaping. External loggers like patch_ng may log filenames that contain brackets (e.g., file[1].txt), which would then be misinterpreted as Rich markup.

Consider formatting the message first, then escaping the result:

Proposed fix
     def filter(self, record: logging.LogRecord) -> bool:
         """Add indentation to the log record message."""
         color = "blue" if record.levelno < logging.WARNING else "yellow"
 
-        line = markup_escape(str(record.msg)).replace("\n", "\n    ")
+        # Format first so args are substituted, then escape the complete message
+        try:
+            formatted = str(record.msg) % record.args if record.args else str(record.msg)
+        except (TypeError, ValueError):
+            formatted = str(record.msg)
+        line = markup_escape(formatted).replace("\n", "\n    ")
         record.msg = f"{self.prefix}[{color}]{line}[/{color}]"
+        record.args = ()
         return True
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dfetch/log.py` around lines 146 - 152, The filter method currently escapes
record.msg but not printf-style record.args, so format the final message first
(use record.getMessage() or equivalent to apply record.args), then escape and
indent that formatted string with markup_escape and newline replacement; assign
the resulting prefixed/colored string back to record.msg and clear record.args
(e.g., set to empty tuple) so the logging framework won't re-format it later;
keep the existing color logic and prefix in the new record.msg assignment.
♻️ Duplicate comments (1)
dfetch/vcs/archive.py (1)

130-132: ⚠️ Potential issue | 🟡 Minor

Decode percent-encoded file:// paths before filesystem operations.

urllib.parse.urlparse() does not percent-decode the path component. A valid file:// URL with spaces (encoded as %20) or other special characters will cause os.path.exists(parsed.path) to return False because the path still contains %20 instead of a literal space.

Use urllib.parse.unquote() followed by urllib.request.url2pathname() to convert the URL path to a proper filesystem path:

Proposed fix
+from urllib.request import url2pathname
+
 def is_accessible(self) -> bool:
     ...
     parsed = urllib.parse.urlparse(self.url)
     if parsed.scheme == "file":
-        return os.path.exists(parsed.path)
+        local_path = url2pathname(urllib.parse.unquote(parsed.path))
+        return os.path.exists(local_path)

The same fix is needed in download() at lines 178-190 where parsed.path is used for open() and shutil.copy().

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dfetch/vcs/archive.py` around lines 130 - 132, The file:// branch in the
Archive class is using parsed.path directly which can contain percent-encoded
characters; update the checks and file operations to first convert the URL path
to a platform file path by calling urllib.parse.unquote(parsed.path) and then
urllib.request.url2pathname(...) before passing to os.path.exists, open(), and
shutil.copy(); apply this change in the method that contains the parsed =
urllib.parse.urlparse(self.url) check (the file-scheme existence check) and
likewise in the download() method where parsed.path is used for open/shutil.copy
so all filesystem operations use the decoded/url2pathname path.
🧹 Nitpick comments (1)
dfetch/reporting/sbom_reporter.py (1)

319-331: Inconsistent handling of empty namespace between archive and VCS refs.

Line 301 uses purl.namespace or None to avoid setting an empty string, but line 322 assigns purl.namespace directly. When vcs_url_to_purl() returns an empty namespace (which it can for generic types), component.group will be set to "" instead of None.

For consistency and cleaner SBOM output, consider applying the same pattern:

Proposed fix
     `@staticmethod`
     def _apply_vcs_refs(component: Component, purl: PackageURL) -> None:
         """Add VCS external reference and group for a generic VCS dependency."""
-        component.group = purl.namespace
+        component.group = purl.namespace or None  # type: ignore[assignment]
         vcs_url = purl.qualifiers.get("vcs_url", "")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dfetch/reporting/sbom_reporter.py` around lines 319 - 331, The
component.group assignment in _apply_vcs_refs currently sets component.group =
purl.namespace which can leave an empty string; change it to use the same
empty-to-None pattern as elsewhere (e.g., component.group = purl.namespace or
None) so empty namespaces become None, leaving the rest of the logic (vcs_url
lookup and ExternalReference creation using ExternalReferenceType.VCS and XsUri)
unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@dfetch/log.py`:
- Around line 146-152: The filter method currently escapes record.msg but not
printf-style record.args, so format the final message first (use
record.getMessage() or equivalent to apply record.args), then escape and indent
that formatted string with markup_escape and newline replacement; assign the
resulting prefixed/colored string back to record.msg and clear record.args
(e.g., set to empty tuple) so the logging framework won't re-format it later;
keep the existing color logic and prefix in the new record.msg assignment.

---

Duplicate comments:
In `@dfetch/vcs/archive.py`:
- Around line 130-132: The file:// branch in the Archive class is using
parsed.path directly which can contain percent-encoded characters; update the
checks and file operations to first convert the URL path to a platform file path
by calling urllib.parse.unquote(parsed.path) and then
urllib.request.url2pathname(...) before passing to os.path.exists, open(), and
shutil.copy(); apply this change in the method that contains the parsed =
urllib.parse.urlparse(self.url) check (the file-scheme existence check) and
likewise in the download() method where parsed.path is used for open/shutil.copy
so all filesystem operations use the decoded/url2pathname path.

---

Nitpick comments:
In `@dfetch/reporting/sbom_reporter.py`:
- Around line 319-331: The component.group assignment in _apply_vcs_refs
currently sets component.group = purl.namespace which can leave an empty string;
change it to use the same empty-to-None pattern as elsewhere (e.g.,
component.group = purl.namespace or None) so empty namespaces become None,
leaving the rest of the logic (vcs_url lookup and ExternalReference creation
using ExternalReferenceType.VCS and XsUri) unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8bd1424e-9465-4474-bd20-5fd241a5a084

📥 Commits

Reviewing files that changed from the base of the PR and between a2a464c and 501e3d9.

📒 Files selected for processing (9)
  • dfetch/log.py
  • dfetch/reporting/sbom_reporter.py
  • dfetch/util/purl.py
  • dfetch/util/util.py
  • dfetch/vcs/archive.py
  • doc/landing-page/conf.py
  • features/steps/generic_steps.py
  • features/steps/manifest_steps.py
  • tests/test_purl.py
✅ Files skipped from review due to trivial changes (1)
  • doc/landing-page/conf.py
🚧 Files skipped from review as they are similar to previous changes (5)
  • features/steps/manifest_steps.py
  • tests/test_purl.py
  • dfetch/util/util.py
  • features/steps/generic_steps.py
  • dfetch/util/purl.py

@spoorcc
Copy link
Copy Markdown
Contributor Author

spoorcc commented Mar 22, 2026

@coderabbitai resume

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 22, 2026

✅ Actions performed

Reviews resumed.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
dfetch/vcs/archive.py (1)

416-418: Consider including symlink/hardlink sizes in the bomb check.

The total bytes calculation only sums m.size for m in members if m.isfile(), excluding symlinks and hardlinks. While links themselves are small, a malicious archive could contain many hardlinks to a large file, potentially bypassing the size check while still extracting significant data. This is a minor edge case given the member count limit.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dfetch/vcs/archive.py` around lines 416 - 418, The archive size check
currently only sums m.size for members where m.isfile(); update the calculation
used before calling ArchiveLocalRepo._check_archive_limits to also include
symlinks and hardlinks by summing m.size for members where m.isfile() or
m.issym() or m.islnk() (i.e., include sizes from TarInfo.isfile(),
TarInfo.issym(), TarInfo.islnk()), keeping the same members list from
tf.getmembers() and passing the new total into _check_archive_limits so
symlink/hardlink sizes are accounted for in the bomb check.
dfetch/log.py (1)

199-217: The rebinding logic correctly handles patch_ng's actual pattern.

The code's check isinstance(attr, types.MethodType) and attr.__self__ is logger successfully handles patch_ng's documented pattern, which uses module-level bound method aliases (e.g., warning = logger.warning). The edge cases mentioned—unbound function references or wrapped methods—don't apply to patch_ng's actual implementation and would be difficult to handle generically. The "best-effort" approach is appropriate here. Adding debug logging when rebinding occurs would aid troubleshooting, though it's not required for correctness.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dfetch/log.py` around lines 199 - 217, The rebinding loop correctly handles
patch_ng but should emit a debug message when a rebind actually happens to aid
troubleshooting: inside the for loop that iterates method_name over
("debug","info","warning","error","critical") in the block that checks
isinstance(attr, types.MethodType) and attr.__self__ is logger, add a
logger.debug call (or logging.getLogger(name).debug) just before calling
setattr(module, method_name, getattr(logger, method_name)) that records the
module name, method_name, and the original attr (e.g., repr(attr)) so reviewers
can see which module-level bound methods were replaced.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@dfetch/log.py`:
- Around line 199-217: The rebinding loop correctly handles patch_ng but should
emit a debug message when a rebind actually happens to aid troubleshooting:
inside the for loop that iterates method_name over
("debug","info","warning","error","critical") in the block that checks
isinstance(attr, types.MethodType) and attr.__self__ is logger, add a
logger.debug call (or logging.getLogger(name).debug) just before calling
setattr(module, method_name, getattr(logger, method_name)) that records the
module name, method_name, and the original attr (e.g., repr(attr)) so reviewers
can see which module-level bound methods were replaced.

In `@dfetch/vcs/archive.py`:
- Around line 416-418: The archive size check currently only sums m.size for
members where m.isfile(); update the calculation used before calling
ArchiveLocalRepo._check_archive_limits to also include symlinks and hardlinks by
summing m.size for members where m.isfile() or m.issym() or m.islnk() (i.e.,
include sizes from TarInfo.isfile(), TarInfo.issym(), TarInfo.islnk()), keeping
the same members list from tf.getmembers() and passing the new total into
_check_archive_limits so symlink/hardlink sizes are accounted for in the bomb
check.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 68963b9f-083a-4037-9d38-3eb61fd0e1dd

📥 Commits

Reviewing files that changed from the base of the PR and between a2a464c and b16855f.

📒 Files selected for processing (9)
  • dfetch/log.py
  • dfetch/reporting/sbom_reporter.py
  • dfetch/util/purl.py
  • dfetch/util/util.py
  • dfetch/vcs/archive.py
  • doc/landing-page/conf.py
  • features/steps/generic_steps.py
  • features/steps/manifest_steps.py
  • tests/test_purl.py
✅ Files skipped from review due to trivial changes (1)
  • doc/landing-page/conf.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/test_purl.py

@spoorcc spoorcc merged commit 8fa0797 into main Mar 22, 2026
35 checks passed
@spoorcc spoorcc deleted the claude/implement-issue-23-fmipS branch March 22, 2026 20:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants