Add file lookup and cache to Dependencies by hagenw · Pull Request #534 · audeering/audb

hagenw · 2026-01-05T09:32:47Z

Speedup parts of audb.Dependencies by using a dictionary for file lookup (audb.Dependencies._file_index), caching of properties (audb.Dependencies._cache) and using numpy for creating masks. In the benchmark results the caching is only visible in Dependencies.tables as this was called already by Dependencies.table_ids.

method	Before	After
Dependencies.save()	0.291	0.289
Dependencies.load()	0.189	0.184
Dependencies.__call__()	0.000	0.000
Dependencies.__contains__(10000 files)	0.005	0.002
Dependencies.__get_item__(10000 files)	0.610	0.588
Dependencies.__len__()	0.000	0.000
Dependencies.__str__()	0.022	0.021
Dependencies.archives	0.145	0.137
Dependencies.attachments	0.026	0.005
Dependencies.attachment_ids	0.028	0.012
Dependencies.files	0.011	0.006
Dependencies.media	0.065	0.017
Dependencies.removed_media	0.057	0.018
Dependencies.table_ids	0.058	0.039
Dependencies.tables	0.025	0.000
Dependencies.archive(10000 files)	0.046	0.045
Dependencies.bit_depth(10000 files)	0.032	0.033
Dependencies.channels(10000 files)	0.032	0.033
Dependencies.checksum(10000 files)	0.033	0.034
Dependencies.duration(10000 files)	0.032	0.032
Dependencies.format(10000 files)	0.033	0.034
Dependencies.removed(10000 files)	0.032	0.033
Dependencies.sampling_rate(10000 files)	0.032	0.034
Dependencies.type(10000 files)	0.032	0.033
Dependencies.version(10000 files)	0.034	0.034
Dependencies.add_attachment()	0.178	0.221
Dependencies._add_media(10000 files)	0.072	0.066
Dependencies._add_meta()	0.122	0.120
Dependencies._drop()	0.091	0.091
Dependencies._remove()	0.071	0.071
Dependencies._update_media()	0.106	0.107
Dependencies._update_media_version(10000 files)	0.017	0.018

Summary by Sourcery

Optimize audb.Dependencies for faster lookups and property access by introducing a lazy file index and cached derived properties that are invalidated on mutation.

Enhancements:

Add a lazy dictionary-based file index to speed up file membership checks and row retrievals.
Cache frequently requested properties such as archives, attachments, media, removed media, files, tables, and their IDs for reuse across calls.
Ensure caches and the file index are consistently invalidated whenever the underlying dependencies dataframe is modified, including load, add, update, and remove operations.

Summary by Sourcery

Optimize audb.Dependencies lookups and property access using a lazy file index and cached derived properties that are invalidated on data changes.

Enhancements:

Introduce a lazily built dictionary index over dependency files to speed up membership checks and row retrieval.
Add a generic cache for frequently requested dependency properties such as archives, attachments, media, tables, and related ID/file lists, with automatic invalidation after mutations.

Tests:

Adjust the dependencies benchmark pre-warm step to better reflect steady-state performance of cached lookups.

sourcery-ai · 2026-01-05T09:32:55Z

Reviewer's Guide

Optimizes audb.Dependencies by introducing a lazy dictionary-based file index and a centralized property cache that are invalidated on all dataframe mutations, and by using NumPy masks instead of pandas boolean indexing for several filtered views, plus a small adjustment to the benchmark pre-warm step.

Sequence diagram for Dependencies caching, index building, and invalidation

sequenceDiagram
    participant Client
    participant Dependencies

    %% File membership check using lazy index
    Client->>Dependencies: __contains__(file)
    alt _file_index is None
        Dependencies->>Dependencies: _get_file_index()
        Dependencies->>Dependencies: build dict from _df.index
    end
    Dependencies-->>Client: bool

    %% Cached property access (example: media)
    Client->>Dependencies: media()
    alt cache hit
        Dependencies->>Dependencies: read _cache[media]
        Dependencies-->>Client: media list
    else cache miss
        Dependencies->>Dependencies: compute NumPy mask from _df[type]
        Dependencies->>Dependencies: build media list from _df.index[mask]
        Dependencies->>Dependencies: store in _cache[media]
        Dependencies-->>Client: media list
    end

    %% Dataframe mutation and cache invalidation (example: _add_media)
    Client->>Dependencies: _add_media(files, archive, format, checksum, bit_depth, channels, sampling_rate, type, version, removed)
    Dependencies->>Dependencies: append to _df
    Dependencies->>Dependencies: _invalidate_cache()
    Dependencies->>Dependencies: clear _cache and reset _file_index to None

Class diagram for updated audb.Dependencies caching and indexing

classDiagram
    class Dependencies {
        - pd.DataFrame _df
        - dict[str,int] _file_index
        - dict[str,object] _cache
        - pa.Schema _schema
        + __call__() pd.DataFrame
        + __contains__(file: str) bool
        + __eq__(other: Dependencies) bool
        + __getitem__(file: str) list
        + __len__() int
        + __str__() str
        + archives() list[str]
        + attachments() list[str]
        + attachment_ids() list[str]
        + files() list[str]
        + media() list[str]
        + removed_media() list[str]
        + table_ids() list[str]
        + tables() list[str]
        + archive(file: str) str
        + removed(file: str) bool
        + load(path: str) void
        - _get_file_index() dict[str,int]
        - _invalidate_cache() void
        - _add_attachment(file: str, archive: str, format: str, checksum: str, version: str) void
        - _add_media(files: Sequence[str], archive: str, format: str, checksum: str, bit_depth: int, channels: int, sampling_rate: int, type: str, version: str, removed: int) void
        - _add_meta(file: str, archive: str, checksum: str, version: str) void
        - _drop(files: Sequence[str]) void
        - _remove(file: str) void
        - _update_media(files: Sequence[str], archive: str, format: str, checksum: str, bit_depth: int, channels: int, sampling_rate: int, type: str, version: str, removed: int) void
        - _update_media_version(files: Sequence[str], version: str) void
    }

File-Level Changes

Change	Details	Files
Add lazy dictionary index for O(1) file lookups and use it in membership and item access.	Initialize self._file_index in init as a lazily-populated dict mapping file name to row position Implement _get_file_index() to build the index from the dataframe index on first use Change contains to check membership in the file index instead of the dataframe index Change getitem to resolve the row position via the file index and index with iloc	`audb/core/dependencies.py`
Introduce centralized cache for derived properties and ensure it is invalidated on all mutations of the dependencies dataframe.	Initialize self._cache in init for generic property caching Add _invalidate_cache() helper that clears the cache and resets the file index Wrap archives, attachments, attachment_ids, files, media, removed_media, tables, and table_ids properties with cache lookups to avoid recomputation Call _invalidate_cache() after load(), _add_attachment(), _add_media(), _add_meta(), _drop(), _remove(), _update_media(), and _update_media_version() to keep cache and index in sync	`audb/core/dependencies.py`
Use NumPy-backed boolean masks instead of pandas boolean indexing for filtered lists to improve performance.	For attachments, attachment_ids, media, removed_media, and tables, convert the type/removed columns to NumPy arrays and compute boolean masks Apply masks to the dataframe index or archive column to build the corresponding lists before caching them	`audb/core/dependencies.py`
Adjust benchmark setup to avoid pre-warming cached properties that would hide the performance gain.	Replace the pre-warm call deps.archives with deps._df in the benchmark script so that property caches remain cold for measurement	`benchmarks/benchmark-dependencies-methods.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey - I've found 2 issues, and left some high level feedback:

The new dictionary-based file index assumes unique index labels for files; if non-unique indexes are possible, consider enforcing or asserting uniqueness (or falling back to the original .loc behavior) to avoid subtle behavior changes in getitem compared to pandas.
In the benchmark, replacing the warm-up call from deps.archives to deps._df is slightly misleading and relies on a private attribute; consider either explicitly warming the relevant cached properties or updating the comment to reflect that you’re intentionally avoiding cache warm-up.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The new dictionary-based file index assumes unique index labels for files; if non-unique indexes are possible, consider enforcing or asserting uniqueness (or falling back to the original .loc behavior) to avoid subtle behavior changes in __getitem__ compared to pandas.
- In the benchmark, replacing the warm-up call from deps.archives to deps._df is slightly misleading and relies on a private attribute; consider either explicitly warming the relevant cached properties or updating the comment to reflect that you’re intentionally avoiding cache warm-up.

## Individual Comments

### Comment 1
<location> `benchmarks/benchmark-dependencies-methods.py:97-101` </location>
<code_context>
 # Further calls will be faster
 "file-10.wav" in deps
-deps.archives
+deps._df
 _ = deps.archive(_files[0])

</code_context>

<issue_to_address>
**suggestion:** The benchmark pre-warm step now bypasses the property logic and no longer exercises the cache.

Accessing `deps._df` here only touches the underlying DataFrame and skips the new caching/indexing logic. If the intent is to benchmark steady-state performance of the public API with caches warmed, consider invoking a real property like `deps.archives` (or a dedicated warm-up helper) instead of the private `_df` attribute.

```suggestion
# Pre-warm accesses
# Further calls will be faster
"file-10.wav" in deps
_ = deps.archives
_ = deps.archive(_files[0])
```
</issue_to_address>

### Comment 2
<location> `audb/core/dependencies.py:67` </location>
<code_context>
         self._df = self._set_dtypes(self._df)
+        # Lazy dictionary index for O(1) file lookups
+        self._file_index: dict[str, int] | None = None
+        # Property cache (invalidated on modifications)
+        self._cache: dict[str, object] = {}
         # pyarrow schema
</code_context>

<issue_to_address>
**issue (complexity):** Consider extracting shared caching and DataFrame-assignment logic into small helper methods to keep the new performance optimizations while reducing repetition and cache-management complexity.

You can keep the performance improvements while reducing the new complexity by:

### 1. Deduplicate the cache pattern

All properties now repeat the same “check key, compute, store, return” logic. A small helper removes boilerplate and makes the contract explicit:

```python
def _get_cached(self, key: str, build: Callable[[], object]) -> object:
    try:
        return self._cache[key]
    except KeyError:
        value = build()
        self._cache[key] = value
        return value
```

Then properties become much simpler and harder to get wrong:

```python
@property
def archives(self) -> list[str]:
    return self._get_cached(
        "archives",
        lambda: sorted(self._df["archive"].unique().tolist()),
    )

@property
def attachments(self) -> list[str]:
    return self._get_cached(
        "attachments",
        lambda: self._df.index[
            self._df["type"].to_numpy() == define.DEPENDENCY_TYPE["attachment"]
        ].tolist(),
    )

@property
def media(self) -> list[str]:
    return self._get_cached(
        "media",
        lambda: self._df.index[
            self._df["type"].to_numpy() == define.DEPENDENCY_TYPE["media"]
        ].tolist(),
    )
```

This keeps your caching behavior but avoids repeated manual dict access and string-key bugs.

### 2. Centralize cache invalidation on `_df` mutation

Right now every mutator manually calls `_invalidate_cache()`. You can centralize this by routing `_df` assignments through a helper, reducing the chance of forgetting invalidation in future changes:

```python
def _set_df(self, df: pd.DataFrame) -> None:
    self._df = df
    self._invalidate_cache()
```

Then update mutation sites to use `_set_df` instead of bare `_df` assignment:

```python
def _add_media(self, values: Sequence[tuple[...]]):
    df = pd.DataFrame.from_records(
        values,
        columns=["file"] + list(define.DEPENDENCY_TABLE.keys()),
    ).set_index("file")
    df = self._set_dtypes(df)
    self._set_df(pd.concat([self._df, df]))

def _drop(self, files: Sequence[str]):
    self._set_df(self._df[~self._df.index.isin(files)])

def load(...):
    ...
    if extension == "csv":
        ...
        self._set_df(self._table_to_dataframe(table))
    elif extension == "parquet":
        ...
        self._set_df(self._table_to_dataframe(table))
```

For in-place updates that don’t replace `_df` (like `_remove`, `_update_media`, `_update_media_version`), you can keep the explicit `_invalidate_cache()` calls, but at least all “replace `_df` wholesale” paths are now guaranteed to invalidate caches.

This keeps your optimization (lazy dict index + property caching) but makes the abstraction more uniform and reduces the mental overhead around cache management.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

frankenjoe · 2026-01-12T08:53:53Z

The benchmark table reports elapsed time in seconds right? Since all values are <1 I assume the benchmark runs on a rather small database. Would be interesting to see how performance changes on a larger database.

hagenw · 2026-01-12T11:57:51Z

Th reported benchmark was for a database with 1,000,000 media files and is indeed in seconds.

I repeated it with 100,000,000 media files

method	before	after
Dependencies.save()	49.055	46.902
Dependencies.load()	32.979	32.974
Dependencies.__call__()	0.000	0.000
Dependencies.__contains__(10000 files)	0.044	0.017
Dependencies.__get_item__(10000 files)	1.029	0.988
Dependencies.__len__()	0.000	0.000
Dependencies.__str__()	12.016	15.002
Dependencies.archives	48.940	41.784
Dependencies.attachments	18.329	2.857
Dependencies.attachment_ids	16.347	3.453
Dependencies.files	2.153	1.332
Dependencies.media	26.424	3.806
Dependencies.removed_media	27.193	3.955
Dependencies.table_ids	23.666	8.896
Dependencies.tables	15.974	0.000
Dependencies.archive(10000 files)	0.101	0.105
Dependencies.bit_depth(10000 files)	0.048	0.050
Dependencies.channels(10000 files)	0.048	0.047
Dependencies.checksum(10000 files)	0.047	0.050
Dependencies.duration(10000 files)	0.046	0.048
Dependencies.format(10000 files)	0.047	0.048
Dependencies.removed(10000 files)	0.048	0.047
Dependencies.sampling_rate(10000 files)	0.046	0.049
Dependencies.type(10000 files)	0.046	0.048
Dependencies.version(10000 files)	0.048	0.049
Dependencies._add_attachment()	36.783	51.717
Dependencies._add_media(10000 files)	15.567	10.718
Dependencies._add_meta()	94.270	80.938
Dependencies._drop()	31.054	26.452
Dependencies._remove()	74.303	63.410
Dependencies._update_media()	27.705	26.397
Dependencies._update_media_version(10000 files)	2.081	2.295

In the long run we still have two challenges:

Memory consumption: I first started to benchmark with 1,000,000,000, but I run out of memory on my Laptop
Speeding up methods that change dependency table (all methods at the end of the benchmark table) as those currently get slower the more files we have. Would be great if we could make them independent of the number of media files included in the database

frankenjoe · 2026-01-12T12:33:53Z

Ok, so the only method that becomes slower is Dependencies._add_attachment(), which I think is fine. Otherwise there is no change or an improvement, which is nice. The code becomes more complicated to read, but I guess it's worth the benefit in terms of speed.

hagenw added 2 commits January 5, 2026 11:34

Add file lookup and cache to Dependencies

e0071f0

Add missing pre-warming step for _df

f36e28e

hagenw force-pushed the speedup-deps-with-index-dict-and-cache branch from 555858d to f36e28e Compare January 5, 2026 10:34

hagenw added 2 commits January 5, 2026 11:36

Clean up benchmark script

638aab5

Update pre-warming to avoid filling cache

bc2874b

hagenw marked this pull request as ready for review January 5, 2026 15:33

hagenw self-assigned this Jan 5, 2026

sourcery-ai Bot reviewed Jan 5, 2026

View reviewed changes

Comment thread benchmarks/benchmark-dependencies-methods.py

hagenw requested a review from frankenjoe January 5, 2026 15:36

hagenw mentioned this pull request Jan 5, 2026

Increase speed of managing dependencies #517

Closed

frankenjoe approved these changes Jan 12, 2026

View reviewed changes

hagenw merged commit ddb135e into main Feb 23, 2026
10 checks passed

hagenw deleted the speedup-deps-with-index-dict-and-cache branch February 23, 2026 18:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add file lookup and cache to Dependencies#534

Add file lookup and cache to Dependencies#534
hagenw merged 4 commits intomainfrom
speedup-deps-with-index-dict-and-cache

hagenw commented Jan 5, 2026 •

edited

Loading

Uh oh!

sourcery-ai Bot commented Jan 5, 2026 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

Uh oh!

frankenjoe commented Jan 12, 2026

Uh oh!

hagenw commented Jan 12, 2026 •

edited

Loading

Uh oh!

frankenjoe commented Jan 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hagenw commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Summary by Sourcery

Uh oh!

sourcery-ai Bot commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for Dependencies caching, index building, and invalidation

Class diagram for updated audb.Dependencies caching and indexing

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

frankenjoe commented Jan 12, 2026

Uh oh!

hagenw commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

frankenjoe commented Jan 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hagenw commented Jan 5, 2026 •

edited

Loading

sourcery-ai Bot commented Jan 5, 2026 •

edited

Loading

hagenw commented Jan 12, 2026 •

edited

Loading