An AI-powered research assistant that scrapes real estate and financial pages, indexes them into a local vector store, and answers context-aware questions using Groq's Llama models.
- URL scraping: Extracts text from web pages using
unstructured. - Vector storage: Uses ChromaDB for local embeddings and persistence.
- Embeddings: Uses the
thenlper/gte-baseembedding model for semantic search. - LLM answering: Uses Groq/ChatGroq (Llama) for responses with contextual retrieval.
- Streamlit frontend: Simple UI to process URLs and chat with indexed documents.
- Create and activate a virtual environment (recommended):
python -m venv .venv
On Windows:
.venv\Scripts\activate
On macOS / Linux:
source .venv/bin/activate
- Install dependencies:
pip install -r requirements.txt
Or install the full stack:
pip install streamlit langchain langchain-community langchain-huggingface langchain-chroma langchain-groq unstructured python-dotenv sentence-transformers
Note (Windows): if unstructured reports file-type or libmagic issues, run:
pip install python-magic-bin
- Create a
.envfile in the project root with your Groq API key:
GROQ_API_KEY=your_groq_api_key_here
The expected layout for the minimal app:
real_state_tool/main.pyorapp.py— Streamlit frontendrag.py— RAG processing & indexing coderequirements.txtresources/— ChromaDB persistence (created at runtime)
Run the Streamlit app:
streamlit run app.py
Workflow:
- Enter up to 3 URLs in the sidebar.
- Click Process URLs to scrape, split, embed, and index content.
- Ask questions in the chat — the app retrieves relevant context and uses the LLM to answer.
Add these to requirements.txt (example):
streamlit
langchain
langchain-community
langchain-huggingface
langchain-chroma
langchain-groq
unstructured
python-dotenv
sentence-transformers
- First run may download the
gte-baseembedding model (~200MB). - Indexed data persists under
resources/; you can reuse it instead of re-processing URLs. - If you need robust HTML scraping, consider adding
beautifulsoup4ortrafilaturaas optional helpers.