A high-performance semantic code search engine designed for the MediaWiki ecosystem. Built on the Jina 0.5b neural retrieval model, optimized for large-scale codebases like MediaWiki Core, Extensions, and WMF Operations. Metadata is managed via indexed SQLite for sub-second responses and a low-memory footprint (Toolforge compatible).
- π Global MediaWiki Indexing: Covers Core, Extensions, Skins, Libraries, Services, and more (2,400+ unique repos).
- π§ Single-Stage Neural Retrieval: Uses
jina-code-embeddings-0.5bwith FAISSIndexIVFPQfor lightning-fast results (approx. 0.3s). - π³ Granular Structural Filtering: High-precision extraction and filtering of Functions, Types, Template Functions, and Template Types across 10 languages.
- ποΈ Split-Build Architecture: Optimized for asymmetric hardwareβrun heavy extraction on a laptop and neural vectorization on a GPU.
- π Massive Localization Footprint: Fully localized UI supporting 17 languages.
- π Octahedron Vortex UI: A visually stunning frontend built with React and Three.js.
The indexing pipeline is designed for a mass-scale, distributed build.
Create and activate a virtual environment (optional but recommended), then install dependencies:
python -m venv venv
# Windows:
.\venv\Scripts\activate
# Linux/macOS:
source venv/bin/activate
pip install -r requirements.txtTo compile the React application, you will need Node.js installed on your machine. Install the dependencies and run the build:
npm install # This also runs 'npm run build' automaticallyThis generates the pre-compiled frontend/js/app.js used by the application.
First, discover the ecosystem and mirror it for processing:
cd preprocessing
python list_repos.py # Fetches 2,400+ repo URLs
python download_repos.py # Shallow clones (approx. 8GB disk space)Ensure all repositories are archived in Software Heritage for on-demand retrieval.
Note
archive_to_swh.py requires a "bulk_save" token. For most users, it is recommended to use:
python archive_individual_to_swh.pyPerform high-precision structural parsing on your local machine. This captures functions/types with qualified names (e.g., Class::Method) and handles complex language features.
Phase 3a: Structural Extraction
python extract_structural_entities.pyPhase 3b: Identity Resolution Resolve Git-compatible hashes to standard SHA1. You can do this either locally (fast) or via the Software Heritage API (official):
- Option A: Local Resolution (Recommended)
python resolve_swh_hashes_local.py
- Option B: API-based Resolution
python resolve_swh_hashes.py
Move raw_functions.json to a GPU-equipped environment to compute neural vectors and build the FAISS index.
cd backend
python generate_index.py # Auto-detects CUDA/GPUBefore deploying, convert the production metadata to SQLite to stay within 6GiB RAM limits:
cd backend
python migrate_to_sqlite.pyOnce the index and database are ready, start the FastAPI backend from the root directory:
# From the project root
uvicorn app:app --host 0.0.0.0 --port 8000The server will be available at http://localhost:8000. You can access the automatic API documentation at http://localhost:8000/docs.
Follow these steps to deploy the application on Wikimedia Toolforge.
Note
The examples below use supnabla as the username and code2codesearch as the project name. Replace these with your own Toolforge credentials where applicable.
Since the model weights and indexes are large, they should be uploaded from your local machine to the Toolforge project data directory:
# From the project root
scp -r "./models" supnabla@login.toolforge.org:/data/project/code2codesearch/
scp -r "./backend/mediawiki.index" supnabla@login.toolforge.org:/data/project/code2codesearch/backend/
scp -r "./backend/functions.db" supnabla@login.toolforge.org:/data/project/code2codesearch/backend/Log into Toolforge and set the necessary permissions:
ssh supnabla@login.toolforge.org
chmod -R a+r /data/project/code2codesearch/mediawiki-code2code-search/models/
chmod a+x /data/project/code2codesearch/backend/functions.db
chmod a+x /data/project/code2codesearch/backend/mediawiki.indexNow you are ready to deploy the webservice:
# Switch to the code2codesearch project
become code2codesearch
# Stop and clean existing build
toolforge webservice buildservice stop --mount=all
toolforge build clean -y
# Start build from repository
toolforge build start https://github.com/ftosoni/mediawiki-code2code-search
# Start webservice with 6GiB RAM
toolforge webservice buildservice start --mount=all -m 6Gi
# Monitor logs
toolforge webservice logs -f- Neural Model: Jina Code Embeddings (0.5b)
- Vector Engine: FAISS (IndexIVFPQ for memory efficiency)
- Segmentation: Tree-sitter
- Archive Access: Software Heritage
- Frontend: React 18 / Three.js
Apache 2.0 License. Created for advanced code-to-code retrieval within the Wikimedia developer ecosystem.