Skip to content

ftosoni/mediawiki-code2code-search

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

130 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

MediaWiki Code2Code Search

SWH SWH Python CI Code Style: PEP8 License

A high-performance semantic code search engine designed for the MediaWiki ecosystem. Built on the Jina 0.5b neural retrieval model, optimized for large-scale codebases like MediaWiki Core, Extensions, and WMF Operations. Metadata is managed via indexed SQLite for sub-second responses and a low-memory footprint (Toolforge compatible).

✨ Key Features

  • πŸ“‚ Global MediaWiki Indexing: Covers Core, Extensions, Skins, Libraries, Services, and more (2,400+ unique repos).
  • 🧠 Single-Stage Neural Retrieval: Uses jina-code-embeddings-0.5b with FAISS IndexIVFPQ for lightning-fast results (approx. 0.3s).
  • 🌳 Granular Structural Filtering: High-precision extraction and filtering of Functions, Types, Template Functions, and Template Types across 10 languages.
  • πŸ—οΈ Split-Build Architecture: Optimized for asymmetric hardwareβ€”run heavy extraction on a laptop and neural vectorization on a GPU.
  • 🌍 Massive Localization Footprint: Fully localized UI supporting 17 languages.
  • 🌌 Octahedron Vortex UI: A visually stunning frontend built with React and Three.js.

πŸš€ Scaling & Pipeline

The indexing pipeline is designed for a mass-scale, distributed build.

πŸ› οΈ Setup

Backend (Python)

Create and activate a virtual environment (optional but recommended), then install dependencies:

python -m venv venv
# Windows:
.\venv\Scripts\activate
# Linux/macOS:
source venv/bin/activate

pip install -r requirements.txt

Frontend (Node.js)

To compile the React application, you will need Node.js installed on your machine. Install the dependencies and run the build:

npm install # This also runs 'npm run build' automatically

This generates the pre-compiled frontend/js/app.js used by the application.

Phase 1: Discovery & Mirroring (Local)

First, discover the ecosystem and mirror it for processing:

cd preprocessing
python list_repos.py      # Fetches 2,400+ repo URLs
python download_repos.py  # Shallow clones (approx. 8GB disk space)

Phase 2: Archiving (Global)

Ensure all repositories are archived in Software Heritage for on-demand retrieval.

Note

archive_to_swh.py requires a "bulk_save" token. For most users, it is recommended to use:

python archive_individual_to_swh.py

Phase 3: Extraction (Local/CPU)

Perform high-precision structural parsing on your local machine. This captures functions/types with qualified names (e.g., Class::Method) and handles complex language features.

Phase 3a: Structural Extraction

python extract_structural_entities.py

Phase 3b: Identity Resolution Resolve Git-compatible hashes to standard SHA1. You can do this either locally (fast) or via the Software Heritage API (official):

  • Option A: Local Resolution (Recommended)
    python resolve_swh_hashes_local.py
  • Option B: API-based Resolution
    python resolve_swh_hashes.py

Phase 4: Indexing (Remote/GPU)

Move raw_functions.json to a GPU-equipped environment to compute neural vectors and build the FAISS index.

cd backend
python generate_index.py  # Auto-detects CUDA/GPU

Phase 5: Memory Optimization & Deployment (Local/Toolforge)

Before deploying, convert the production metadata to SQLite to stay within 6GiB RAM limits:

cd backend
python migrate_to_sqlite.py

Once the index and database are ready, start the FastAPI backend from the root directory:

# From the project root
uvicorn app:app --host 0.0.0.0 --port 8000

The server will be available at http://localhost:8000. You can access the automatic API documentation at http://localhost:8000/docs.


πŸš€ Deployment (Toolforge)

Follow these steps to deploy the application on Wikimedia Toolforge.

Note

The examples below use supnabla as the username and code2codesearch as the project name. Replace these with your own Toolforge credentials where applicable.

1. Upload Assets

Since the model weights and indexes are large, they should be uploaded from your local machine to the Toolforge project data directory:

# From the project root
scp -r "./models" supnabla@login.toolforge.org:/data/project/code2codesearch/
scp -r "./backend/mediawiki.index" supnabla@login.toolforge.org:/data/project/code2codesearch/backend/
scp -r "./backend/functions.db" supnabla@login.toolforge.org:/data/project/code2codesearch/backend/

2. Configure Permissions

Log into Toolforge and set the necessary permissions:

ssh supnabla@login.toolforge.org

chmod -R a+r /data/project/code2codesearch/mediawiki-code2code-search/models/
chmod a+x /data/project/code2codesearch/backend/functions.db
chmod a+x /data/project/code2codesearch/backend/mediawiki.index

3. Deploy

Now you are ready to deploy the webservice:

# Switch to the code2codesearch project
become code2codesearch

# Stop and clean existing build
toolforge webservice buildservice stop --mount=all
toolforge build clean -y

# Start build from repository
toolforge build start https://github.com/ftosoni/mediawiki-code2code-search

# Start webservice with 6GiB RAM
toolforge webservice buildservice start --mount=all -m 6Gi

# Monitor logs
toolforge webservice logs -f

πŸ› οΈ Technology Stack

πŸ“„ Licence

Apache 2.0 License. Created for advanced code-to-code retrieval within the Wikimedia developer ecosystem.

About

MediaWiki Code2Code Search is a high-performance semantic search tool designed specifically for the MediaWiki open-source ecosystem, integrated with the Software Heritage archive. It utilises a single-stage neural retrieval architecture to help developers navigate complex codebases with high precision and minimal resource usage.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors