MediaWiki Code2Code Search

A high-performance semantic code search engine designed for the MediaWiki ecosystem. Built on the Jina 0.5b neural retrieval model, optimized for large-scale codebases like MediaWiki Core, Extensions, and WMF Operations. Metadata is managed via indexed SQLite for sub-second responses and a low-memory footprint (Toolforge compatible).

✨ Key Features

📂 Global MediaWiki Indexing: Covers Core, Extensions, Skins, Libraries, Services, and more (2,400+ unique repos).
🧠 Single-Stage Neural Retrieval: Uses jina-code-embeddings-0.5b with FAISS IndexIVFPQ for lightning-fast results (approx. 0.3s).
🌳 Granular Structural Filtering: High-precision extraction and filtering of Functions, Types, Template Functions, and Template Types across 10 languages.
🏗️ Split-Build Architecture: Optimized for asymmetric hardware—run heavy extraction on a laptop and neural vectorization on a GPU.
🌍 Massive Localization Footprint: Fully localized UI supporting 17 languages.
🌌 Octahedron Vortex UI: A visually stunning frontend built with React and Three.js.

🚀 Scaling & Pipeline

The indexing pipeline is designed for a mass-scale, distributed build.

🛠️ Setup

Backend (Python)

Create and activate a virtual environment (optional but recommended), then install dependencies:

python -m venv venv
# Windows:
.\venv\Scripts\activate
# Linux/macOS:
source venv/bin/activate

pip install -r requirements.txt

Frontend (Node.js)

To compile the React application, you will need Node.js installed on your machine. Install the dependencies and run the build:

npm install # This also runs 'npm run build' automatically

This generates the pre-compiled frontend/js/app.js used by the application.

Phase 1: Discovery & Mirroring (Local)

First, discover the ecosystem and mirror it for processing:

cd preprocessing
python list_repos.py      # Fetches 2,400+ repo URLs
python download_repos.py  # Shallow clones (approx. 8GB disk space)

Phase 2: Archiving (Global)

Ensure all repositories are archived in Software Heritage for on-demand retrieval.

Note

archive_to_swh.py requires a "bulk_save" token. For most users, it is recommended to use:

python archive_individual_to_swh.py

Phase 3: Extraction (Local/CPU)

Perform high-precision structural parsing on your local machine. This captures functions/types with qualified names (e.g., Class::Method) and handles complex language features.

Phase 3a: Structural Extraction

python extract_structural_entities.py

Phase 3b: Identity Resolution Resolve Git-compatible hashes to standard SHA1. You can do this either locally (fast) or via the Software Heritage API (official):

Option A: Local Resolution (Recommended)
```
python resolve_swh_hashes_local.py
```
Option B: API-based Resolution
```
python resolve_swh_hashes.py
```

Phase 4: Indexing (Remote/GPU)

Move raw_functions.json to a GPU-equipped environment to compute neural vectors and build the FAISS index.

cd backend
python generate_index.py  # Auto-detects CUDA/GPU

Phase 5: Memory Optimization & Deployment (Local/Toolforge)

Before deploying, convert the production metadata to SQLite to stay within 6GiB RAM limits:

cd backend
python migrate_to_sqlite.py

Once the index and database are ready, start the FastAPI backend from the root directory:

# From the project root
uvicorn app:app --host 0.0.0.0 --port 8000

The server will be available at http://localhost:8000. You can access the automatic API documentation at http://localhost:8000/docs.

🚀 Deployment (Toolforge)

Follow these steps to deploy the application on Wikimedia Toolforge.

Note

The examples below use supnabla as the username and code2codesearch as the project name. Replace these with your own Toolforge credentials where applicable.

1. Upload Assets

Since the model weights and indexes are large, they should be uploaded from your local machine to the Toolforge project data directory:

# From the project root
scp -r "./models" supnabla@login.toolforge.org:/data/project/code2codesearch/
scp -r "./backend/mediawiki.index" supnabla@login.toolforge.org:/data/project/code2codesearch/backend/
scp -r "./backend/functions.db" supnabla@login.toolforge.org:/data/project/code2codesearch/backend/

2. Configure Permissions

Log into Toolforge and set the necessary permissions:

ssh supnabla@login.toolforge.org

chmod -R a+r /data/project/code2codesearch/mediawiki-code2code-search/models/
chmod a+x /data/project/code2codesearch/backend/functions.db
chmod a+x /data/project/code2codesearch/backend/mediawiki.index

3. Deploy

Now you are ready to deploy the webservice:

# Switch to the code2codesearch project
become code2codesearch

# Stop and clean existing build
toolforge webservice buildservice stop --mount=all
toolforge build clean -y

# Start build from repository
toolforge build start https://github.com/ftosoni/mediawiki-code2code-search

# Start webservice with 6GiB RAM
toolforge webservice buildservice start --mount=all -m 6Gi

# Monitor logs
toolforge webservice logs -f

🛠️ Technology Stack

Neural Model: Jina Code Embeddings (0.5b)
Vector Engine: FAISS (IndexIVFPQ for memory efficiency)
Segmentation: Tree-sitter
Archive Access: Software Heritage
Frontend: React 18 / Three.js

📄 Licence

Apache 2.0 License. Created for advanced code-to-code retrieval within the Wikimedia developer ecosystem.

Name		Name	Last commit message	Last commit date
Latest commit History 130 Commits
.github		.github
assets/branding		assets/branding
backend		backend
frontend		frontend
preprocessing		preprocessing
scripts		scripts
tests		tests
tmp		tmp
.babelrc		.babelrc
.gitignore		.gitignore
.python-version		.python-version
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
GUIDELINES.md		GUIDELINES.md
LICENSE.md		LICENSE.md
Procfile		Procfile
README.md		README.md
SECURITY.md		SECURITY.md
app.py		app.py
codemeta.json		codemeta.json
download_models.py		download_models.py
package-lock.json		package-lock.json
package.json		package.json
profile_search.py		profile_search.py
pytest.ini		pytest.ini
requirements.txt		requirements.txt
update_release.py		update_release.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MediaWiki Code2Code Search

✨ Key Features

🚀 Scaling & Pipeline

🛠️ Setup

Backend (Python)

Frontend (Node.js)

Phase 1: Discovery & Mirroring (Local)

Phase 2: Archiving (Global)

Phase 3: Extraction (Local/CPU)

Phase 4: Indexing (Remote/GPU)

Phase 5: Memory Optimization & Deployment (Local/Toolforge)

🚀 Deployment (Toolforge)

1. Upload Assets

2. Configure Permissions

3. Deploy

🛠️ Technology Stack

📄 Licence

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MediaWiki Code2Code Search

✨ Key Features

🚀 Scaling & Pipeline

🛠️ Setup

Backend (Python)

Frontend (Node.js)

Phase 1: Discovery & Mirroring (Local)

Phase 2: Archiving (Global)

Phase 3: Extraction (Local/CPU)

Phase 4: Indexing (Remote/GPU)

Phase 5: Memory Optimization & Deployment (Local/Toolforge)

🚀 Deployment (Toolforge)

1. Upload Assets

2. Configure Permissions

3. Deploy

🛠️ Technology Stack

📄 Licence

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages