From 258e11f0a2f02b6eacd5b3aa97ca3993ca3fe38c Mon Sep 17 00:00:00 2001 From: nscuro Date: Fri, 5 Jun 2026 15:20:54 +0200 Subject: [PATCH] Update package metadata resolution design doc Adds section about PURL normalization and how that relates to artifact granularity. Also adds cross-references to ADRs and other parts of the docs where appropriate. Signed-off-by: nscuro --- .../design/package-metadata-resolution.md | 72 ++++++++++++++++--- 1 file changed, 62 insertions(+), 10 deletions(-) diff --git a/docs/concepts/architecture/design/package-metadata-resolution.md b/docs/concepts/architecture/design/package-metadata-resolution.md index a6bee361..0965acc1 100644 --- a/docs/concepts/architecture/design/package-metadata-resolution.md +++ b/docs/concepts/architecture/design/package-metadata-resolution.md @@ -7,7 +7,7 @@ repositories. This includes latest available versions, artifact hashes, and publ Dependency-Track uses this data for latest version checks, component age policies, and integrity verification. A singleton [durable workflow](durable-execution.md) orchestrates resolution and -processes packages in controlled batches. ADR-015 documents the data model. +processes packages in controlled batches. [ADR-015] documents the data model. ## Responsible consumption of public infrastructure @@ -51,7 +51,7 @@ This separates package-level information (latest version) from artifact-level in without corresponding package metadata, and the orchestration logic respects the resulting write-order dependency. -Refer to ADR-015 for the full rationale. +Refer to [ADR-015] for the full rationale. ### Persistence @@ -60,6 +60,52 @@ A temporal guard (`WHERE "RESOLVED_AT" < EXCLUDED."RESOLVED_AT"`) prevents older from overwriting newer ones. Writes use PostgreSQL `UNNEST` to batch many rows per statement, reducing round trips. +## PURL normalization + +A raw component PURL is rarely the right key for fetching upstream metadata. PURLs carry +information that does not affect identity (such as repository URLs, checksums, or BOM-specific +qualifiers), and they often omit information that the ecosystem needs to fetch a specific +artifact. Each resolver factory implements `normalize` to map a PURL to the shape it +expects, returning `null` if the resolver does not support the type. + +PURL normalization serves two purposes: + +1. It selects which resolver handles a given PURL. The first factory whose `normalize` + returns a non-null result wins. +2. It shapes the PURL into the exact key that gets persisted in `PACKAGE_ARTIFACT_METADATA`. + +Artifact granularity is thus not controlled by the input PURL, but by the resolver. + +### Maven + +A Maven coordinate is not fully identified by group, artifact, and version. The same +GAV can produce more than one artifact, distinguished by the [`type` and `classifier` +qualifiers][purl-maven-qualifiers]: `type` selects the packaging (`jar`, `war`, `pom`, `aar`), +and `classifier` selects a variant (`sources`, `javadoc`, `linux-x86_64`). Each has its own file on disk and its +own hash. The Maven resolver normalizes by keeping `type` (defaulting to `jar`, matching the +Maven convention) and `classifier` if present, and dropping everything else. The result is +that `pkg:maven/com.acme/acme-lib@1.2.3` and `pkg:maven/com.acme/acme-lib@1.2.3?classifier=sources` +resolve to distinct `PACKAGE_ARTIFACT_METADATA` rows with distinct hashes. + +### PyPI + +A PyPI release is not a single file. A version typically ships one source distribution +(`.tar.gz`) plus a fan of built wheels, one per supported Python version, ABI, and platform +(`cp311-cp311-manylinux_2_17_x86_64.whl`, `cp312-cp312-macosx_11_0_arm64.whl`, and so on). +Each file has its own hash, and the PyPI JSON API returns hashes per file rather than per +release. Without a way to pick one, the resolver cannot define which hash to record. + +To handle this, the PyPI resolver requires a [`file_name` qualifier][purl-pypi-qualifiers] on +the input PURL and preserves it during normalization. SBOM tooling that emits PyPI components +needs to include `file_name` to let the resolver record hashes. PURLs without it are not +rejected outright, but the resolver cannot disambiguate which file applies. + +### Other ecosystems + +Most other ecosystems publish one artifact per version (npm tarballs, RubyGems gems, +crates), so their resolvers drop all qualifiers during normalization and key on +`type/namespace/name@version` alone. + ## Workflow Package metadata resolution is a singleton [durable workflow](durable-execution.md). @@ -142,14 +188,17 @@ empty results for these so they don't re-appear as candidates in later batches. One activity per resolver processes its assigned PURLs sequentially. Different resolver activities run concurrently. +Before processing, the activity loads existing `PACKAGE_ARTIFACT_METADATA` rows for the batch and +passes each PURL's prior result to the resolver as a hint. Resolvers that can short-circuit on it +(for example, Maven for stable versions) skip the corresponding upstream fetch. + For each PURL, the activity: -1. Checks if the resolver already produced a result within the last 5 minutes (idempotency guard for retries). -2. Normalizes the PURL via the resolver factory. -3. Looks up configured repositories for the PURL type, ordered by resolution priority. -4. Iterates repositories, respecting internal/external classification, invoking the resolver +1. Normalizes the PURL via the resolver factory. +2. Looks up configured repositories for the PURL type, ordered by resolution priority. +3. Iterates repositories, respecting internal/external classification, invoking the resolver until one succeeds or none remain. -5. Buffers results and flushes to the database in batches of 25. +4. Buffers results and flushes to the database in batches of 25. If a resolver signals a retryable error (for example, HTTP 429), the activity flushes any buffered results and propagates the error. The durable execution engine then retries the activity with backoff. @@ -196,7 +245,7 @@ concurrency interact. ## Resolver extension point -Resolvers are pluggable via the plugin system. The API surface consists of two interfaces: +Resolvers are pluggable via the [extensibility](extensibility.md) system. The API surface consists of two interfaces: * **`PackageMetadataResolverFactory`**: Creates resolver instances. Declares whether the resolver needs a repository, normalizes PURLs (returning `null` to signal non-support), and exposes the @@ -235,15 +284,18 @@ The [durable execution](durable-execution.md) engine handles crash recovery and transparently: * If a node crashes mid-resolution, the workflow resumes from the last completed step on restart. - Results already flushed to the database are not re-processed, because the 5-minute idempotency - window in the activity causes the activity to skip recently resolved PURLs. + PURLs the prior run flushed before the crash no longer match the candidate query, because + their `RESOLVED_AT` timestamps fall within the 24-hour freshness window. * When resolution fails for a specific resolver even after exhausting retries, the workflow catches the `ActivityFailureException`, logs the failure, and continues. The workflow persists results from other resolvers normally. * On graceful shutdown, the activity checks for thread interruption before each PURL, flushes buffered results, and propagates the interruption. +[ADR-015]: https://github.com/DependencyTrack/dependency-track/blob/main/docs/adr/015-package-metadata.md [maven-overconsumption]: https://www.sonatype.com/blog/beyond-ips-addressing-organizational-overconsumption-in-maven-central [maven-tragedy]: https://www.sonatype.com/blog/maven-central-and-the-tragedy-of-the-commons [open-not-costless]: https://www.sonatype.com/blog/open-is-not-costless-reclaiming-sustainable-infrastructure +[purl-maven-qualifiers]: https://github.com/package-url/purl-spec/blob/main/types-doc/maven-definition.md#qualifiers-definition +[purl-pypi-qualifiers]: https://github.com/package-url/purl-spec/blob/main/types-doc/pypi-definition.md#qualifiers-definition [rfc-conditional]: https://www.rfc-editor.org/rfc/rfc7232