Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
196 changes: 196 additions & 0 deletions ai-code-search/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,196 @@
# 🔎 AI Semantic Code Search using Endee

## 📌 Overview

This project implements an **AI-powered semantic code search system** built on top of the **Endee Vector Database**.

Traditional code search relies on keyword matching, which often fails to retrieve relevant results when the exact keywords are not present. This project solves that problem by using **vector embeddings and semantic similarity search**.

The system indexes source code from GitHub repositories, converts the code into vector embeddings using **Sentence Transformers**, and stores those vectors in the **Endee vector database**. Users can then search for relevant code snippets using **natural language queries**.

---

## 🧠 How It Works

The system performs semantic search using the following pipeline:

GitHub Repository
Repository Loader
Code Parser
Code Chunking
Embedding Model (Sentence Transformers)
Endee Vector Database
Semantic Code Search


---

## 📂 Project Structure

ai-code-search/
├── core/
│ ├── embedder.py # Generates vector embeddings
│ ├── indexer.py # Inserts embeddings into Endee
│ └── searcher.py # Performs semantic vector search
├── scripts/
│ ├── repo_loader.py # Clones GitHub repositories
│ └── code_parser.py # Parses and chunks source code
├── data/
│ └── repos/ # Downloaded repositories
├── screenshots/
│ ├── endee_dashboard.png
│ ├── indexing_process.png
│ └── search_results.png
└── README.md


---

## ✨ Features

- Clone GitHub repositories automatically
- Extract and parse source code files
- Split code into chunks for indexing
- Generate embeddings using **Sentence Transformers**
- Store vectors in **Endee Vector Database**
- Perform **semantic search using natural language queries**
- Retrieve relevant code files based on meaning instead of keywords

---

## 🔍 Example Search

Example user query: jwt authentication


### Example Output

Result 1
Score: 0.286
File: data/repos/requests/src/requests/auth.py

Result 2
Score: 0.285
File: data/repos/requests/src/requests/cookies.py


The system identifies relevant code files related to authentication even if the query wording differs.

---

## 🖼 Screenshots

### 1️⃣ Endee Vector Index Dashboard

Shows the indexed vectors stored in Endee.
![alt text](image.png)

---

### 2️⃣ Indexing Process

Terminal output while indexing a repository.
![alt text](image-1.png)


---

### 3️⃣ Semantic Search Results

Search results returned for a natural language query.
![alt text](image-2.png)
---

## ⚙️ Setup Instructions

### 1️⃣ Clone the Repository

git clone https://github.com/GitNinja4/endee
cd ai-code-search


---

### 2️⃣ Install Dependencies

pip install sentence-transformers requests msgpack


---

### 3️⃣ Start Endee Vector Database

Run the following command from the root directory:

docker compose up -d


Open Endee dashboard: http://localhost:8080


---

### 4️⃣ Index a Repository

Run the indexing pipeline: python -m core.indexer


This step:

- parses source code
- generates embeddings
- stores vectors in Endee

---

### 5️⃣ Run Semantic Search

Run the search engine: python -m core.searcher

python -m core.searcher


Example query: jwt authentication


The system returns the most relevant code files based on semantic similarity.

---

## 🛠 Technologies Used

- Python
- Sentence Transformers
- Endee Vector Database
- Docker
- Git
- Vector Similarity Search

---

## 🚀 Future Improvements

Possible extensions for this project include:

- returning actual **code snippets** instead of just file paths
- adding a **FastAPI backend for search API**
- building a **web interface for code search**
- indexing multiple repositories automatically
- ranking results using additional metadata

---

## 📄 License

This project is developed for experimentation and learning purposes using vector databases and semantic search.
Empty file added ai-code-search/api/app.py
Empty file.
Empty file added ai-code-search/core/__init__.py
Empty file.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
49 changes: 49 additions & 0 deletions ai-code-search/core/embedder.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
"""
embedder.py

This module generates vector embeddings for code snippets
using Sentence Transformers.
"""

from sentence_transformers import SentenceTransformer
import numpy as np


class CodeEmbedder:
def __init__(self, model_name="all-MiniLM-L6-v2"):
"""
Initialize embedding model.
"""
print("Loading embedding model...")
self.model = SentenceTransformer(model_name)
print("Model loaded successfully.")

def embed_text(self, text: str):
"""
Convert a single text/code snippet into a vector embedding.
"""
embedding = self.model.encode(text)
return embedding.tolist()

def embed_batch(self, texts):
"""
Convert multiple texts into embeddings.
"""
embeddings = self.model.encode(texts)
return [e.tolist() for e in embeddings]


# Quick test
if __name__ == "__main__":
embedder = CodeEmbedder()

sample_code = """
def create_access_token(data):
payload = jwt.encode(data, SECRET_KEY)
return payload
"""

vector = embedder.embed_text(sample_code)

print("Vector length:", len(vector))
print("First 10 values:", vector[:10])
109 changes: 109 additions & 0 deletions ai-code-search/core/indexer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
"""
indexer.py

This module takes parsed code chunks,
generates embeddings, and inserts them
into the Endee vector database.
"""

import requests
from core.embedder import CodeEmbedder
from scripts.code_parser import parse_repository


# Endee server configuration
ENDEE_URL = "http://localhost:8080"
INDEX_NAME = "code_index"
VECTOR_DIM = 384


def create_index():
"""
Create a vector index in Endee.
"""

print("Creating index...")

url = f"{ENDEE_URL}/api/v1/index/create"

payload = {
"index_name": INDEX_NAME,
"dim": VECTOR_DIM,
"space_type": "cosine"
}

response = requests.post(url, json=payload)

if response.status_code == 200:
print("Index created successfully.")
else:
print("Index creation response:", response.text)


def insert_vector(vector_id, vector, metadata):
"""
Insert a vector into Endee.
Endee expects a list of objects.
"""

url = f"{ENDEE_URL}/api/v1/index/{INDEX_NAME}/vector/insert"

payload = [
{
"id": vector_id,
"vector": vector,
"meta": metadata
}
]

response = requests.post(url, json=payload)

if response.status_code == 200:
print(f"Inserted {vector_id}")
else:
print("Insert failed:", response.status_code, response.text)


def index_repository(repo_path):
"""
Parse repository, generate embeddings,
and store them in Endee.
"""

print("\nParsing repository...")

chunks = parse_repository(repo_path)

embedder = CodeEmbedder()

print("\nIndexing code chunks...")

for i, chunk in enumerate(chunks):

code = chunk["code"]
file_path = chunk["file"]

# Generate embedding
embedding = embedder.embed_text(code)

vector_id = f"chunk_{i}"

insert_vector(
vector_id,
embedding,
file_path
)

if i % 10 == 0:
print(f"Indexed {i} chunks")

print("\nIndexing complete.")


if __name__ == "__main__":

repo_path = "data/repos/requests"

create_index()

index_repository(repo_path)
Loading