Skip to content

feat: add --component-map path-prefix backend for package-database-free rootfs#111

Closed
hanthor wants to merge 1 commit intocoreos:mainfrom
hanthor:feat/component-map
Closed

feat: add --component-map path-prefix backend for package-database-free rootfs#111
hanthor wants to merge 1 commit intocoreos:mainfrom
hanthor:feat/component-map

Conversation

@hanthor
Copy link
Copy Markdown

@hanthor hanthor commented Apr 14, 2026

Closes #110 (partial — see also #112).

Adds --component-map PATH and a pathmap backend: a JSON file of path-prefix → component rules, for rootfs images with no package DB and no xattrs.

Tested on Dakota (7.4 GiB, 197k files, 120 layers): 81% file coverage, 54 components, 22.1 s ±1.29 s.

All 58 tests pass, just clippy clean.

Assisted-by: Claude Sonnet 4.6

Add a new `pathmap` component repo that assigns files to components
based on path-prefix rules loaded from a JSON file. This is useful for
build systems like BuildStream (used by GNOME OS) that produce rootfs
images without a package database, making it impossible to use the RPM
or deb backends.

The file format is a JSON array of objects with three fields:
  - `prefix`    (string, required): absolute path prefix to match
  - `component` (string, required): component name to assign
  - `interval`  (string, optional): "daily", "weekly", or "monthly"
                  (default); controls the packing stability weight

Rules are evaluated in order; the first matching prefix wins. Claims
from this repo are weak, so xattr and package-database repos take
precedence when present.

The feature is exposed via a new `--component-map PATH` flag on the
`build` subcommand. The error message shown when no repo is found is
updated to mention this flag as a fallback option.

Stability is modelled as a Poisson process: P(no change in T days) =
e^(-T/period_days). This keeps all intervals strictly ordered and
non-zero regardless of the stability window size.

Assisted-by: Claude Sonnet 4.6

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a path-prefix component mapping feature, enabling file assignment to components via a JSON configuration when package databases are absent. It adds a --component-map CLI argument and implements PathmapRepo to handle prefix-based claims with stability weights. Reviewer feedback suggests optimizing memory and performance by replacing the pre-computed path-to-component HashMap with lazy rule evaluation in the claim pass, thereby avoiding redundant file iterations and reducing memory consumption for large filesystems.

Comment thread src/components/pathmap.rs
Comment on lines +87 to +95
pub struct PathmapRepo {
/// Component names, indexed by ComponentId.
components: IndexSet<String>,
/// Per-component metadata (stability, etc.).
component_meta: Vec<ComponentMeta>,
/// Pre-computed path → ComponentId map.
path_to_component: HashMap<Utf8PathBuf, ComponentId>,
default_mtime_clamp: u64,
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The path_to_component map can lead to significant memory overhead for large rootfs images, as it clones and stores every matching path. It is more efficient to store the rules and evaluate them lazily during the claim pass. This avoids duplicating path strings that are already present in the FileMap.

Suggested change
pub struct PathmapRepo {
/// Component names, indexed by ComponentId.
components: IndexSet<String>,
/// Per-component metadata (stability, etc.).
component_meta: Vec<ComponentMeta>,
/// Pre-computed path → ComponentId map.
path_to_component: HashMap<Utf8PathBuf, ComponentId>,
default_mtime_clamp: u64,
}
pub struct PathmapRepo {
/// Component names, indexed by ComponentId.
components: IndexSet<String>,
/// Per-component metadata (stability, etc.).
component_meta: Vec<ComponentMeta>,
/// Path-prefix rules.
rules: Vec<(Utf8PathBuf, ComponentId)>,
default_mtime_clamp: u64,
}

Comment thread src/components/pathmap.rs
Comment on lines +99 to +146
pub fn load(map_path: &Utf8Path, files: &FileMap, default_mtime_clamp: u64) -> Result<Self> {
let content = std::fs::read_to_string(map_path)
.with_context(|| format!("reading path-map file {map_path}"))?;
let entries: Vec<PathMapEntry> = serde_json::from_str(&content)
.with_context(|| format!("parsing path-map file {map_path}"))?;

let mut components: IndexSet<String> = IndexSet::new();
let mut component_meta: Vec<ComponentMeta> = Vec::new();
let mut path_to_component: HashMap<Utf8PathBuf, ComponentId> = HashMap::new();

// Pre-intern all component names and their metadata so that we can look
// them up by ComponentId during the main loop without re-computing the
// interval → stability mapping for every file.
let mut entry_ids: Vec<ComponentId> = Vec::with_capacity(entries.len());
for entry in &entries {
let stability = entry.interval.to_stability();
let (idx, inserted) = components.insert_full(entry.component.clone());
if inserted {
component_meta.push(ComponentMeta { stability });
}
entry_ids.push(ComponentId(idx));
}

for file_path in files.keys() {
for (entry, &comp_id) in entries.iter().zip(entry_ids.iter()) {
let prefix = Utf8Path::new(&entry.prefix);
if file_path.starts_with(prefix) {
path_to_component.insert(file_path.clone(), comp_id);
break; // first matching rule wins
}
}
}

tracing::debug!(
path = %map_path,
rules = entries.len(),
components = components.len(),
paths = path_to_component.len(),
"loaded pathmap components"
);

Ok(Self {
components,
component_meta,
path_to_component,
default_mtime_clamp,
})
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The load function currently iterates over all files in the rootfs to pre-compute claims. This is redundant because claim_pass already iterates over files and calls weak_claims_for_path. By moving the matching logic to weak_claims_for_path, we avoid processing files that are already claimed by higher-priority repositories (like xattr or rpm).

Additionally, this implementation merges stability values for duplicate component names by taking the minimum (most conservative) value, and ensures prefixes are treated as absolute paths for consistent matching.

    pub fn load(map_path: &Utf8Path, _files: &FileMap, default_mtime_clamp: u64) -> Result<Self> {
        let content = std::fs::read_to_string(map_path)
            .with_context(|| format!("reading path-map file {map_path}"))?;
        let entries: Vec<PathMapEntry> = serde_json::from_str(&content)
            .with_context(|| format!("parsing path-map file {map_path}"))?;

        let mut components: IndexSet<String> = IndexSet::new();
        let mut component_meta: Vec<ComponentMeta> = Vec::new();
        let mut rules: Vec<(Utf8PathBuf, ComponentId)> = Vec::with_capacity(entries.len());

        for entry in entries {
            let stability = entry.interval.to_stability();
            let (idx, inserted) = components.insert_full(entry.component);
            if inserted {
                component_meta.push(ComponentMeta { stability });
            } else {
                // Merge stability: use the minimum (most conservative) value
                component_meta[idx].stability = component_meta[idx].stability.min(stability);
            }

            let mut prefix = entry.prefix;
            if !prefix.starts_with('/') {
                prefix.insert(0, '/');
            }
            rules.push((Utf8PathBuf::from(prefix), ComponentId(idx)));
        }

        tracing::debug!(
            path = %map_path,
            rules = rules.len(),
            components = components.len(),
            "loaded pathmap components"
        );

        Ok(Self {
            components,
            component_meta,
            rules,
            default_mtime_clamp,
        })
    }

Comment thread src/components/pathmap.rs
Comment on lines +158 to +169
fn weak_claims_for_path(
&self,
_rootfs: &Dir,
path: &Utf8Path,
_file_info: &super::FileInfo,
) -> Result<Vec<ComponentId>> {
Ok(self
.path_to_component
.get(path)
.map(|id| vec![*id])
.unwrap_or_default())
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Evaluate rules lazily to avoid the memory overhead of a pre-computed path map. This also ensures that we only perform prefix matching for files that haven't been claimed by higher-priority repositories.

Suggested change
fn weak_claims_for_path(
&self,
_rootfs: &Dir,
path: &Utf8Path,
_file_info: &super::FileInfo,
) -> Result<Vec<ComponentId>> {
Ok(self
.path_to_component
.get(path)
.map(|id| vec![*id])
.unwrap_or_default())
}
fn weak_claims_for_path(
&self,
_rootfs: &Dir,
path: &Utf8Path,
_file_info: &super::FileInfo,
) -> Result<Vec<ComponentId>> {
for (prefix, id) in &self.rules {
if path.starts_with(prefix) {
return Ok(vec![*id]);
}
}
Ok(vec![])
}

@hanthor
Copy link
Copy Markdown
Author

hanthor commented Apr 14, 2026

dakota component-map example (source) — 127 rules covering locale-data, firmware, kernel, fonts, ghostty, wallpapers, docs, python3, and more.

@hanthor
Copy link
Copy Markdown
Author

hanthor commented Apr 14, 2026

Benchmark: Dakota (7.4 GiB, 197k files, 120 layers)

baseline +component-map +filemap (PR #112)
time 16.2 s ±0.75 s 22.1 s ±1.29 s 17.5 s ±0.20 s
components 0 54 713
coverage 81% / 6.0 GiB 99% / 7.3 GiB
unclaimed 7.4 GiB 274 MiB 13 MiB

Note: BST xattrs are stripped on OCI export — xattr repo returns 0 matches on exported images.

@hanthor
Copy link
Copy Markdown
Author

hanthor commented Apr 15, 2026

Closing in favour of #113. The path-prefix map approach works but covers only 81% of files and requires manually maintaining rules. Once #113 is resolved (libc fallback for xattr reads), an LD_PRELOAD sidecar can serve the full file→element map generated by bst artifact list-contents with zero chunkah code changes and 99% coverage.

@hanthor hanthor closed this Apr 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support rechunking BuildStream images

1 participant