This document outlines the architectural constraints and development best practices for the MediaWiki Code2Code Search project.
The application is deployed on Wikimedia Toolforge, which has a strict 6 GiB RAM limit for webservices.
- SQLite for Metadata: Avoid loading large JSON/list structures into memory. Use the indexed SQLite database (
backend/functions.db) for all metadata lookups. - FAISS Indexing: Use
IndexIVFPQor other compressed FAISS indexes to keep the memory footprint low. - Lazy Loading: Ensure models and indexes are loaded once during the server lifespan, not on every request.
The production environment (Toolforge) is CPU-only.
- Recall vs. Rerank: Highly accurate but heavy models (like Rerankers) should be used sparingly or optimized. The current architecture prioritizes the recall model (
jina-code-embeddings-0.5b) which is fast on CPUs. - Quantization: If memory or speed becomes an issue, consider dynamic quantization for models.
The project follows a "Build Heavy, Serve Light" philosophy:
- Indexing (GPU): Extraction and neural vectorization should be performed on a GPU-equipped machine to generate the
mediawiki.index. - Serving (CPU): The
app.pyserver is optimized for high-speed retrieval on standard CPU hardware.
- Use
httpx.AsyncClientfor external API calls (e.g., Software Heritage) to avoid blocking the event loop. - Endpoints should be
async defwhen performing I/O.
- Paths should be relative to the application root where possible, or use the
BASE_DIRpattern defined inapp.py. - The application includes specific patches (e.g., environment variables for
torch, user identification) to run smoothly in Toolforge's Kubernetes environment.
- The frontend supports multiple languages. When adding features, ensure strings are externalized into the i18n JSON files.
- The
update_i18n.pyscript can be used to synchronize translation keys.
- Mocking: When writing tests for the API, always mock the
SentenceTransformerand other heavy models to ensure CI runs quickly without downloading weights. - Safe Testing: Never run tests that write to production folders (
backend/swh_cache, etc.). Use temporary directories for test artifacts.