Skip to content

refactor: replace file-based HTTP cache with SQLite backend#5501

Open
baszalmstra wants to merge 8 commits intoprefix-dev:mainfrom
baszalmstra:claude/optimize-pypi-caching-2uVVC
Open

refactor: replace file-based HTTP cache with SQLite backend#5501
baszalmstra wants to merge 8 commits intoprefix-dev:mainfrom
baszalmstra:claude/optimize-pypi-caching-2uVVC

Conversation

@baszalmstra
Copy link
Contributor

Description

Replaces the default file-based CACacheManager with a new SqliteCacheManager that stores all cached HTTP responses in a single SQLite database file instead of many small files on disk.

Motivation:

  • File-based caching creates many small files, which performs poorly on HPC and network filesystems and windows.
  • A single SQLite database file is more efficient for these environments
  • Reduces filesystem overhead and improves concurrent access patterns

Implementation Details:

  • New SqliteCacheManager implements the CacheManager trait from http_cache_reqwest
  • Database uses WAL journal mode for good concurrent read performance
  • Sets synchronous = NORMAL since this is a cache and data loss on crash is acceptable
  • Response body stored as raw BLOB (no serialization overhead)
  • Response metadata (headers, status, url, version) and cache policy stored as JSON columns
  • Includes 5-second busy timeout for concurrent process coordination
  • Parent directory is created automatically if it doesn't exist

Fixes #5439

How Has This Been Tested?

The change integrates with existing HTTP caching infrastructure. The CacheManager trait implementation ensures compatibility with the http_cache_reqwest library's cache layer. Existing code paths that use HTTP caching will automatically use the new SQLite backend without modification.

Further testing should be done manually and in CI.

AI Disclosure

Written by Claude Code Opus 4.6 Extended.

Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas

The PyPI mapping system was using cacache (CACacheManager) which creates
many small files on disk. This works poorly on HPC and network
filesystems where metadata operations on many small files are expensive.

Replace CACacheManager with a new SqliteCacheManager that stores all
HTTP cache entries in a single SQLite database file. The implementation:

- Uses WAL journal mode for good concurrent read performance
- Sets synchronous=NORMAL since this is a cache (crash data loss is OK)
- Configures a 5s busy_timeout for concurrent process access
- Serializes HttpResponse + CachePolicy together as JSON blobs
- Fully respects HTTP cache semantics (same CacheManager trait)

The SQLite database is stored at:
  ~/.cache/pixi/conda-pypi-mapping/http_cache.sqlite

https://claude.ai/code/session_01XykR7AMvHDmUnrhnzptwW1
bincode serializes the response body as raw bytes, avoiding the base64
overhead that serde_json would introduce for the Vec<u8> body field.
This also matches what the original CACacheManager used.

https://claude.ai/code/session_01XykR7AMvHDmUnrhnzptwW1
… columns

Instead of serializing the entire HttpResponse+CachePolicy as a single
blob, split the schema into three columns:
- body: raw BLOB (no serialization overhead for response bytes)
- response_meta: JSON (headers, status, url, version)
- policy: JSON (HTTP cache policy)

This avoids any encoding overhead for the response body and keeps the
metadata human-readable for debugging.

https://claude.ai/code/session_01XykR7AMvHDmUnrhnzptwW1
Move the SQLite-backed CacheManager out of pypi_mapping into a
standalone crate at crates/http_cache_sqlite. This implementation is
not pixi-specific and can be reused by any consumer of http-cache-reqwest
that wants a single-file SQLite cache instead of many small files.

https://claude.ai/code/session_01XykR7AMvHDmUnrhnzptwW1
Tests cover:
- get on missing key returns None
- put then get roundtrips body, status, and headers
- put overwrites existing entries
- delete removes entries
- delete on nonexistent key is ok
- multiple keys are independent
- response headers are preserved
- binary body (all 256 byte values including null)
- empty body
- data persists across reopen of the database
- parent directories are created automatically

https://claude.ai/code/session_01XykR7AMvHDmUnrhnzptwW1
The workspace clippy config disallows std::fs methods. Switch
create_dir_all to fs_err::create_dir_all for better error messages.

https://claude.ai/code/session_01XykR7AMvHDmUnrhnzptwW1
Store the SQLite database directly as ~/.cache/pixi/conda-pypi-mapping.sqlite
instead of nesting it inside a subdirectory. Simpler and avoids creating
an extra directory just for one file.

https://claude.ai/code/session_01XykR7AMvHDmUnrhnzptwW1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

"File still doesn't exist" error during conda-pypi mapping fetch

2 participants