Conversation
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…to feature/python/rag-agent
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR implements a comprehensive RAG (Retrieval-Augmented Generation) agent for the IOC.EAssistant chatbot, transforming it from a simple echo service into a fully functional AI-powered assistant. The implementation supports dual model providers (OpenAI and Ollama), integrates document vectorization with ChromaDB, and provides a stateless REST API that interfaces with a .NET backend for conversation history management.
Key changes:
- RAG Implementation: Added complete RAG pipeline with LangChain agent, tool calling, MMR-based retrieval, and web search fallback
- Dual Provider Support: Flexible architecture supporting both OpenAI (cloud) and Ollama (local) model providers
- Document Processing: Implemented crawler enhancements and vectorization pipeline with metadata extraction and ChromaDB storage
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 19 comments.
Show a summary per file
| File | Description |
|---|---|
| python/web.py | Deleted placeholder Flask app with echo functionality |
| python/app.py | New Flask app with RAG agent integration, conversation history support, and OpenAI-compatible API format |
| python/rag_agent.py | Core RAG agent implementation with LangChain, dual provider support, MMR retrieval, and web search tools |
| python/vectorize_documents.py | Document vectorization pipeline with metadata extraction, chunking, and ChromaDB persistence |
| python/utils.py | Utility functions for GPU configuration and document formatting |
| python/crawler.py | Enhanced to assign document types ('general' vs 'noticia') for filtered retrieval |
| python/requirements.txt | Added LangChain ecosystem dependencies and updated existing packages |
| python/README.md | Comprehensive documentation covering setup, providers, API usage, and architecture |
| python/.gitignore | Added chroma_db/ to ignored directories |
| python/.env.example | Added configuration for model providers, embeddings, LLM models, and database settings |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
python/app.py
Outdated
| subprocess.run( | ||
| [sys.executable, "crawler.py"], | ||
| cwd=os.path.dirname(os.path.abspath(__file__)), | ||
| capture_output=True, | ||
| text=True, | ||
| timeout=600 # 10 minutes timeout | ||
| ) | ||
|
|
||
| # If ChromaDB doesn't exist, run vectorize_documents | ||
| if not chroma_db_exists: | ||
| print("vectorizing documents...") | ||
| subprocess.run( | ||
| [sys.executable, "vectorize_documents.py"], | ||
| cwd=os.path.dirname(os.path.abspath(__file__)), | ||
| capture_output=True, | ||
| text=True, | ||
| timeout=600 # 10 minutes timeout |
There was a problem hiding this comment.
The subprocess call to crawler.py doesn't validate or sanitize environment variables that could affect its execution. If an attacker can control environment variables (e.g., through Docker), they could potentially inject malicious commands. Consider validating inputs or using more secure subprocess invocation methods.
| subprocess.run( | |
| [sys.executable, "crawler.py"], | |
| cwd=os.path.dirname(os.path.abspath(__file__)), | |
| capture_output=True, | |
| text=True, | |
| timeout=600 # 10 minutes timeout | |
| ) | |
| # If ChromaDB doesn't exist, run vectorize_documents | |
| if not chroma_db_exists: | |
| print("vectorizing documents...") | |
| subprocess.run( | |
| [sys.executable, "vectorize_documents.py"], | |
| cwd=os.path.dirname(os.path.abspath(__file__)), | |
| capture_output=True, | |
| text=True, | |
| timeout=600 # 10 minutes timeout | |
| # Sanitize environment for subprocess | |
| safe_env = {k: v for k, v in os.environ.items() if k not in ["PYTHONPATH", "PATH"]} | |
| safe_env["PATH"] = "/usr/bin:/bin" # Set a safe default PATH | |
| subprocess.run( | |
| [sys.executable, "crawler.py"], | |
| cwd=os.path.dirname(os.path.abspath(__file__)), | |
| capture_output=True, | |
| text=True, | |
| timeout=600, # 10 minutes timeout | |
| env=safe_env | |
| ) | |
| # If ChromaDB doesn't exist, run vectorize_documents | |
| if not chroma_db_exists: | |
| print("vectorizing documents...") | |
| # Sanitize environment for subprocess | |
| safe_env = {k: v for k, v in os.environ.items() if k not in ["PYTHONPATH", "PATH"]} | |
| safe_env["PATH"] = "/usr/bin:/bin" # Set a safe default PATH | |
| subprocess.run( | |
| [sys.executable, "vectorize_documents.py"], | |
| cwd=os.path.dirname(os.path.abspath(__file__)), | |
| capture_output=True, | |
| text=True, | |
| timeout=600, # 10 minutes timeout | |
| env=safe_env |
python/app.py
Outdated
| subprocess.run( | ||
| [sys.executable, "crawler.py"], | ||
| cwd=os.path.dirname(os.path.abspath(__file__)), | ||
| capture_output=True, | ||
| text=True, | ||
| timeout=600 # 10 minutes timeout | ||
| ) |
There was a problem hiding this comment.
Running the crawler synchronously during application startup with a 10-minute timeout can cause deployment delays and potential startup timeouts. Consider running this as a separate background job or scheduled task instead of blocking application startup.
|
@copilot open a new pull request to apply changes based on the comments in this thread |
|
@xujiongze I've opened a new pull request, #95, to work on those changes. Once the pull request is ready, I'll request review from you. |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
@xujiongze I've opened a new pull request, #96, to work on those changes. Once the pull request is ready, I'll request review from you. |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
@xujiongze I've opened a new pull request, #97, to work on those changes. Once the pull request is ready, I'll request review from you. |
|
@xujiongze I've opened a new pull request, #98, to work on those changes. Once the pull request is ready, I'll request review from you. |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
@xujiongze I've opened a new pull request, #99, to work on those changes. Once the pull request is ready, I'll request review from you. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 24 comments.
Comments suppressed due to low confidence (1)
python/app.py:7
- Import of 'asyncio' is not used.
from datetime import datetime
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| print(f" Date: {sample.metadata.get('date', 'N/A')}") | ||
| print(f" Content preview: {sample.page_content[:150]}...\n") | ||
|
|
||
| print(f"Splitting documents into chunks (size={chunk_size}, overlap={chunk_overlap})...") |
There was a problem hiding this comment.
[nitpick] The comment on line 196 says "size={chunk_size}, overlap={chunk_overlap}" but the actual values used are hardcoded to 700 and 120 (lines 200-201), not the parameters. Update the comment to reflect the actual values or fix the code to use the parameters.
| print(f"Splitting documents into chunks (size={chunk_size}, overlap={chunk_overlap})...") | |
| print("Splitting documents into chunks (size=700, overlap=120)...") |
| self.agent = create_agent(self.llm, self.tools, system_prompt=system_prompt) | ||
| self.use_agent = True | ||
| print("Using agent mode with tool calling") | ||
| except NotImplementedError: | ||
| print("Agent mode not supported, langchain version may be outdated.") | ||
| self.use_agent = False |
There was a problem hiding this comment.
The agent initialization always sets self.use_agent = True on line 263, but this variable is never used anywhere in the code. The code then proceeds to call self.agent.invoke() without checking if the agent was successfully created. If create_agent raises NotImplementedError, the exception is caught and self.use_agent is set to False, but in the query and query_with_history methods, the code still attempts to invoke self.agent without checking this flag, which will cause an AttributeError.
Add a check in the query methods:
if not self.use_agent or not hasattr(self, 'agent'):
# Use fallback directly
return self._fallback_rag_search(question)
try:
response = self.agent.invoke({"messages": messages})
...Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
@xujiongze I've opened a new pull request, #100, to work on those changes. Once the pull request is ready, I'll request review from you. |
|
@xujiongze I've opened a new pull request, #101, to work on those changes. Once the pull request is ready, I'll request review from you. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Updated!