Skip to content

dhimanharman866-bit/Code-mind

Repository files navigation

RAG Code Analyzer

An intelligent codebase analysis system that leverages Retrieval-Augmented Generation (RAG) to understand, debug, and interact with large codebases using natural language.

This project bridges the gap between static code and dynamic reasoning by combining semantic search with large language models.


Overview

Modern codebases are large, fragmented, and difficult to reason about. Traditional debugging tools rely heavily on manual inspection and lack contextual understanding.

RAG Code Analyzer introduces an AI-assisted workflow where:

  • Code is transformed into semantically meaningful chunks
  • Embedded into a vector database
  • Retrieved contextually based on user queries
  • Interpreted using an LLM to generate precise, contextual answers

Key Features

  • Context-aware code understanding
  • Semantic search across entire codebases
  • Intelligent debugging assistance
  • Modular RAG pipeline (loader → chunker → embeddings → retriever → QA)
  • Streamlit interface for interactive querying
  • File change detection via hashing (optional optimization)

Architecture

The system follows a modular pipeline:

  1. Loader Responsible for ingesting raw code files

  2. Chunker Splits code into meaningful segments

  3. Embeddings Converts code chunks into vector representations

  4. Vector Store Stores embeddings for efficient similarity search

  5. Retriever Fetches the most relevant chunks based on query

  6. QA Chain Generates contextual responses using LLM


Tech Stack

  • Python
  • LangChain
  • FAISS (vector similarity search)
  • Streamlit
  • Groq LLM API

Project Structure

.
├── app.py              # Streamlit UI
├── main.py             # Entry point
├── loader.py           # Code ingestion
├── chunker.py          # Code splitting
├── embeddings.py       # Embedding generation
├── vectorstore.py      # FAISS integration
├── retriever.py        # Context retrieval
├── qa_chain.py         # LLM interaction
├── config.py           # Configuration
├── requirements.txt
└── .gitignore

How It Works

  1. The system loads and processes a codebase
  2. Files are split into semantically meaningful chunks
  3. Each chunk is converted into embeddings
  4. User query is embedded and matched against stored vectors
  5. Relevant code snippets are retrieved
  6. LLM generates a contextual answer based on retrieved data

Installation

git clone https://github.com/your-username/rag-code-analyzer.git
cd rag-code-analyzer
pip install -r requirements.txt

Usage

streamlit run app.py

Then open the interface in your browser and start querying your codebase.


Example Use Cases

  • "Explain the flow of this module"
  • "Where is this function defined?"
  • "Why might this error be occurring?"
  • "Summarize the logic of this file"

Design Philosophy

This project is built on three principles:

  1. Modularity Each component is independently extensible

  2. Explainability Retrieval ensures responses are grounded in actual code

  3. Practicality Designed for real-world debugging, not just experimentation


Limitations

  • Performance depends on embedding quality
  • Large repositories may require optimization
  • Requires API access for LLM responses

Future Improvements

  • Multi-language code support
  • AST-based chunking
  • Code graph integration
  • Incremental indexing
  • Deployment-ready API layer

Author

Harmanpreet Dhiman AI + Software Engineering Enthusiast


License

This project is open-source and available under the MIT License.

About

CodeMind — AI-powered codebase analysis using RAG, LangChain, FAISS and Groq LLM. Ask questions about your code in plain English.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages