Add file lookup and cache to Dependencies#534
Conversation
Reviewer's GuideOptimizes audb.Dependencies by introducing a lazy dictionary-based file index and a centralized property cache that are invalidated on all dataframe mutations, and by using NumPy masks instead of pandas boolean indexing for several filtered views, plus a small adjustment to the benchmark pre-warm step. Sequence diagram for Dependencies caching, index building, and invalidationsequenceDiagram
participant Client
participant Dependencies
%% File membership check using lazy index
Client->>Dependencies: __contains__(file)
alt _file_index is None
Dependencies->>Dependencies: _get_file_index()
Dependencies->>Dependencies: build dict from _df.index
end
Dependencies-->>Client: bool
%% Cached property access (example: media)
Client->>Dependencies: media()
alt cache hit
Dependencies->>Dependencies: read _cache[media]
Dependencies-->>Client: media list
else cache miss
Dependencies->>Dependencies: compute NumPy mask from _df[type]
Dependencies->>Dependencies: build media list from _df.index[mask]
Dependencies->>Dependencies: store in _cache[media]
Dependencies-->>Client: media list
end
%% Dataframe mutation and cache invalidation (example: _add_media)
Client->>Dependencies: _add_media(files, archive, format, checksum, bit_depth, channels, sampling_rate, type, version, removed)
Dependencies->>Dependencies: append to _df
Dependencies->>Dependencies: _invalidate_cache()
Dependencies->>Dependencies: clear _cache and reset _file_index to None
Class diagram for updated audb.Dependencies caching and indexingclassDiagram
class Dependencies {
- pd.DataFrame _df
- dict[str,int] _file_index
- dict[str,object] _cache
- pa.Schema _schema
+ __call__() pd.DataFrame
+ __contains__(file: str) bool
+ __eq__(other: Dependencies) bool
+ __getitem__(file: str) list
+ __len__() int
+ __str__() str
+ archives() list[str]
+ attachments() list[str]
+ attachment_ids() list[str]
+ files() list[str]
+ media() list[str]
+ removed_media() list[str]
+ table_ids() list[str]
+ tables() list[str]
+ archive(file: str) str
+ removed(file: str) bool
+ load(path: str) void
- _get_file_index() dict[str,int]
- _invalidate_cache() void
- _add_attachment(file: str, archive: str, format: str, checksum: str, version: str) void
- _add_media(files: Sequence[str], archive: str, format: str, checksum: str, bit_depth: int, channels: int, sampling_rate: int, type: str, version: str, removed: int) void
- _add_meta(file: str, archive: str, checksum: str, version: str) void
- _drop(files: Sequence[str]) void
- _remove(file: str) void
- _update_media(files: Sequence[str], archive: str, format: str, checksum: str, bit_depth: int, channels: int, sampling_rate: int, type: str, version: str, removed: int) void
- _update_media_version(files: Sequence[str], version: str) void
}
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
555858d to
f36e28e
Compare
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- The new dictionary-based file index assumes unique index labels for files; if non-unique indexes are possible, consider enforcing or asserting uniqueness (or falling back to the original .loc behavior) to avoid subtle behavior changes in getitem compared to pandas.
- In the benchmark, replacing the warm-up call from deps.archives to deps._df is slightly misleading and relies on a private attribute; consider either explicitly warming the relevant cached properties or updating the comment to reflect that you’re intentionally avoiding cache warm-up.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The new dictionary-based file index assumes unique index labels for files; if non-unique indexes are possible, consider enforcing or asserting uniqueness (or falling back to the original .loc behavior) to avoid subtle behavior changes in __getitem__ compared to pandas.
- In the benchmark, replacing the warm-up call from deps.archives to deps._df is slightly misleading and relies on a private attribute; consider either explicitly warming the relevant cached properties or updating the comment to reflect that you’re intentionally avoiding cache warm-up.
## Individual Comments
### Comment 1
<location> `benchmarks/benchmark-dependencies-methods.py:97-101` </location>
<code_context>
# Further calls will be faster
"file-10.wav" in deps
-deps.archives
+deps._df
_ = deps.archive(_files[0])
</code_context>
<issue_to_address>
**suggestion:** The benchmark pre-warm step now bypasses the property logic and no longer exercises the cache.
Accessing `deps._df` here only touches the underlying DataFrame and skips the new caching/indexing logic. If the intent is to benchmark steady-state performance of the public API with caches warmed, consider invoking a real property like `deps.archives` (or a dedicated warm-up helper) instead of the private `_df` attribute.
```suggestion
# Pre-warm accesses
# Further calls will be faster
"file-10.wav" in deps
_ = deps.archives
_ = deps.archive(_files[0])
```
</issue_to_address>
### Comment 2
<location> `audb/core/dependencies.py:67` </location>
<code_context>
self._df = self._set_dtypes(self._df)
+ # Lazy dictionary index for O(1) file lookups
+ self._file_index: dict[str, int] | None = None
+ # Property cache (invalidated on modifications)
+ self._cache: dict[str, object] = {}
# pyarrow schema
</code_context>
<issue_to_address>
**issue (complexity):** Consider extracting shared caching and DataFrame-assignment logic into small helper methods to keep the new performance optimizations while reducing repetition and cache-management complexity.
You can keep the performance improvements while reducing the new complexity by:
### 1. Deduplicate the cache pattern
All properties now repeat the same “check key, compute, store, return” logic. A small helper removes boilerplate and makes the contract explicit:
```python
def _get_cached(self, key: str, build: Callable[[], object]) -> object:
try:
return self._cache[key]
except KeyError:
value = build()
self._cache[key] = value
return value
```
Then properties become much simpler and harder to get wrong:
```python
@property
def archives(self) -> list[str]:
return self._get_cached(
"archives",
lambda: sorted(self._df["archive"].unique().tolist()),
)
@property
def attachments(self) -> list[str]:
return self._get_cached(
"attachments",
lambda: self._df.index[
self._df["type"].to_numpy() == define.DEPENDENCY_TYPE["attachment"]
].tolist(),
)
@property
def media(self) -> list[str]:
return self._get_cached(
"media",
lambda: self._df.index[
self._df["type"].to_numpy() == define.DEPENDENCY_TYPE["media"]
].tolist(),
)
```
This keeps your caching behavior but avoids repeated manual dict access and string-key bugs.
### 2. Centralize cache invalidation on `_df` mutation
Right now every mutator manually calls `_invalidate_cache()`. You can centralize this by routing `_df` assignments through a helper, reducing the chance of forgetting invalidation in future changes:
```python
def _set_df(self, df: pd.DataFrame) -> None:
self._df = df
self._invalidate_cache()
```
Then update mutation sites to use `_set_df` instead of bare `_df` assignment:
```python
def _add_media(self, values: Sequence[tuple[...]]):
df = pd.DataFrame.from_records(
values,
columns=["file"] + list(define.DEPENDENCY_TABLE.keys()),
).set_index("file")
df = self._set_dtypes(df)
self._set_df(pd.concat([self._df, df]))
def _drop(self, files: Sequence[str]):
self._set_df(self._df[~self._df.index.isin(files)])
def load(...):
...
if extension == "csv":
...
self._set_df(self._table_to_dataframe(table))
elif extension == "parquet":
...
self._set_df(self._table_to_dataframe(table))
```
For in-place updates that don’t replace `_df` (like `_remove`, `_update_media`, `_update_media_version`), you can keep the explicit `_invalidate_cache()` calls, but at least all “replace `_df` wholesale” paths are now guaranteed to invalidate caches.
This keeps your optimization (lazy dict index + property caching) but makes the abstraction more uniform and reduces the mental overhead around cache management.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
|
The benchmark table reports elapsed time in seconds right? Since all values are <1 I assume the benchmark runs on a rather small database. Would be interesting to see how performance changes on a larger database. |
|
Th reported benchmark was for a database with 1,000,000 media files and is indeed in seconds. I repeated it with 100,000,000 media files
In the long run we still have two challenges:
|
|
Ok, so the only method that becomes slower is |
Speedup parts of
audb.Dependenciesby using a dictionary for file lookup (audb.Dependencies._file_index), caching of properties (audb.Dependencies._cache) and usingnumpyfor creating masks. In the benchmark results the caching is only visible inDependencies.tablesas this was called already byDependencies.table_ids.Summary by Sourcery
Optimize audb.Dependencies for faster lookups and property access by introducing a lazy file index and cached derived properties that are invalidated on mutation.
Enhancements:
Summary by Sourcery
Optimize audb.Dependencies lookups and property access using a lazy file index and cached derived properties that are invalidated on data changes.
Enhancements:
Tests: