Feature/python/rag agent by xujiongze · Pull Request #94 · Onededios/IOC.EAssistant

xujiongze · 2025-11-28T13:02:44Z

Updated!

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…to feature/python/rag-agent

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR implements a comprehensive RAG (Retrieval-Augmented Generation) agent for the IOC.EAssistant chatbot, transforming it from a simple echo service into a fully functional AI-powered assistant. The implementation supports dual model providers (OpenAI and Ollama), integrates document vectorization with ChromaDB, and provides a stateless REST API that interfaces with a .NET backend for conversation history management.

Key changes:

RAG Implementation: Added complete RAG pipeline with LangChain agent, tool calling, MMR-based retrieval, and web search fallback
Dual Provider Support: Flexible architecture supporting both OpenAI (cloud) and Ollama (local) model providers
Document Processing: Implemented crawler enhancements and vectorization pipeline with metadata extraction and ChromaDB storage

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 19 comments.

Show a summary per file

File	Description
python/web.py	Deleted placeholder Flask app with echo functionality
python/app.py	New Flask app with RAG agent integration, conversation history support, and OpenAI-compatible API format
python/rag_agent.py	Core RAG agent implementation with LangChain, dual provider support, MMR retrieval, and web search tools
python/vectorize_documents.py	Document vectorization pipeline with metadata extraction, chunking, and ChromaDB persistence
python/utils.py	Utility functions for GPU configuration and document formatting
python/crawler.py	Enhanced to assign document types ('general' vs 'noticia') for filtered retrieval
python/requirements.txt	Added LangChain ecosystem dependencies and updated existing packages
python/README.md	Comprehensive documentation covering setup, providers, API usage, and architecture
python/.gitignore	Added chroma_db/ to ignored directories
python/.env.example	Added configuration for model providers, embeddings, LLM models, and database settings

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

python/requirements.txt

python/rag_agent.py

python/app.py

Copilot · 2025-11-28T13:09:20Z

python/app.py

+      subprocess.run(
+          [sys.executable, "crawler.py"],
+          cwd=os.path.dirname(os.path.abspath(__file__)),
+          capture_output=True,
+          text=True,
+          timeout=600  # 10 minutes timeout
+      )
+
+    # If ChromaDB doesn't exist, run vectorize_documents
+    if not chroma_db_exists:
+      print("vectorizing documents...")
+      subprocess.run(
+          [sys.executable, "vectorize_documents.py"],
+          cwd=os.path.dirname(os.path.abspath(__file__)),
+          capture_output=True,
+          text=True,
+          timeout=600  # 10 minutes timeout


The subprocess call to crawler.py doesn't validate or sanitize environment variables that could affect its execution. If an attacker can control environment variables (e.g., through Docker), they could potentially inject malicious commands. Consider validating inputs or using more secure subprocess invocation methods.

Suggested change

subprocess.run(

[sys.executable, "crawler.py"],

cwd=os.path.dirname(os.path.abspath(__file__)),

capture_output=True,

text=True,

timeout=600 # 10 minutes timeout

)

# If ChromaDB doesn't exist, run vectorize_documents

if not chroma_db_exists:

print("vectorizing documents...")

subprocess.run(

[sys.executable, "vectorize_documents.py"],

cwd=os.path.dirname(os.path.abspath(__file__)),

capture_output=True,

text=True,

timeout=600 # 10 minutes timeout

# Sanitize environment for subprocess

safe_env = {k: v for k, v in os.environ.items() if k not in ["PYTHONPATH", "PATH"]}

safe_env["PATH"] = "/usr/bin:/bin" # Set a safe default PATH

subprocess.run(

[sys.executable, "crawler.py"],

cwd=os.path.dirname(os.path.abspath(__file__)),

capture_output=True,

text=True,

timeout=600, # 10 minutes timeout

env=safe_env

)

# If ChromaDB doesn't exist, run vectorize_documents

if not chroma_db_exists:

print("vectorizing documents...")

# Sanitize environment for subprocess

safe_env = {k: v for k, v in os.environ.items() if k not in ["PYTHONPATH", "PATH"]}

safe_env["PATH"] = "/usr/bin:/bin" # Set a safe default PATH

subprocess.run(

[sys.executable, "vectorize_documents.py"],

cwd=os.path.dirname(os.path.abspath(__file__)),

capture_output=True,

text=True,

timeout=600, # 10 minutes timeout

env=safe_env

Copilot · 2025-11-28T13:09:20Z

python/app.py

+      subprocess.run(
+          [sys.executable, "crawler.py"],
+          cwd=os.path.dirname(os.path.abspath(__file__)),
+          capture_output=True,
+          text=True,
+          timeout=600  # 10 minutes timeout
+      )


Running the crawler synchronously during application startup with a 10-minute timeout can cause deployment delays and potential startup timeouts. Consider running this as a separate background job or scheduled task instead of blocking application startup.

python/README.md

python/app.py

xujiongze · 2025-11-28T21:51:22Z

@copilot open a new pull request to apply changes based on the comments in this thread

Copilot · 2025-11-28T21:51:30Z

@xujiongze I've opened a new pull request, #95, to work on those changes. Once the pull request is ready, I'll request review from you.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot · 2025-11-28T22:06:27Z

@xujiongze I've opened a new pull request, #96, to work on those changes. Once the pull request is ready, I'll request review from you.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot · 2025-11-28T22:07:26Z

@xujiongze I've opened a new pull request, #97, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot · 2025-11-28T22:10:28Z

@xujiongze I've opened a new pull request, #98, to work on those changes. Once the pull request is ready, I'll request review from you.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot · 2025-11-28T22:10:54Z

@xujiongze I've opened a new pull request, #99, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 24 comments.

Comments suppressed due to low confidence (1)

python/app.py:7

Import of 'asyncio' is not used.

from datetime import datetime

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

python/app.py

python/vectorize_documents.py

python/README.md

python/rag_agent.py

Copilot · 2025-11-28T22:12:01Z

python/vectorize_documents.py

+        print(f"  Date: {sample.metadata.get('date', 'N/A')}")
+        print(f"  Content preview: {sample.page_content[:150]}...\n")
+
+    print(f"Splitting documents into chunks (size={chunk_size}, overlap={chunk_overlap})...")


[nitpick] The comment on line 196 says "size={chunk_size}, overlap={chunk_overlap}" but the actual values used are hardcoded to 700 and 120 (lines 200-201), not the parameters. Update the comment to reflect the actual values or fix the code to use the parameters.

Suggested change

print(f"Splitting documents into chunks (size={chunk_size}, overlap={chunk_overlap})...")

print("Splitting documents into chunks (size=700, overlap=120)...")

python/rag_agent.py

python/vectorize_documents.py

Copilot · 2025-11-28T22:12:02Z

python/rag_agent.py

+            self.agent = create_agent(self.llm, self.tools, system_prompt=system_prompt)
+            self.use_agent = True
+            print("Using agent mode with tool calling")
+        except NotImplementedError:
+            print("Agent mode not supported, langchain version may be outdated.")
+            self.use_agent = False


The agent initialization always sets self.use_agent = True on line 263, but this variable is never used anywhere in the code. The code then proceeds to call self.agent.invoke() without checking if the agent was successfully created. If create_agent raises NotImplementedError, the exception is caught and self.use_agent is set to False, but in the query and query_with_history methods, the code still attempts to invoke self.agent without checking this flag, which will cause an AttributeError.

Add a check in the query methods:

if not self.use_agent or not hasattr(self, 'agent'): # Use fallback directly return self._fallback_rag_search(question) try: response = self.agent.invoke({"messages": messages}) ...

python/app.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot · 2025-11-29T09:53:36Z

@xujiongze I've opened a new pull request, #100, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot · 2025-11-29T09:54:17Z

@xujiongze I've opened a new pull request, #101, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

python/app.py

xujiongze and others added 23 commits October 22, 2025 17:57

Added crawler with loki logger

6e80a61

Update python/loki_logger.py

b24d19a

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update python/loki_logger.py

12c813b

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update python/loki_logger.py

f9cb3f2

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update python/README.md

fe83faf

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update python/README.md

63b3cc5

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update Environment variables

793dac5

rag agent

e8a3043

Update api format

09b86f2

Added openai

f5abc5f

Merge branch 'main' of https://github.com/Onededios/IOC.EAssistant in…

9706bfa

…to feature/python/rag-agent

Update python/rag_agent.py

eee4557

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update python/vectorize_documents.py

6b467ee

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update python/vectorize_documents.py

7e7eed7

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update python/vectorize_documents.py

bc51d57

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update python/rag_agent.py

7f4f096

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update python/rag_agent.py

f768688

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update python/vectorize_documents.py

80c0a3d

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update python/vectorize_documents.py

d595348

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update python/vectorize_documents.py

b409752

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update python/web.py

5a1b251

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Added production package and check requisits before starting server

8fd72ae

Update model name

f88ad91

Copilot AI review requested due to automatic review settings November 28, 2025 13:02

Copilot started reviewing on behalf of xujiongze November 28, 2025 13:03 View session

Copilot finished reviewing on behalf of xujiongze November 28, 2025 13:05

Copilot AI reviewed Nov 28, 2025

View reviewed changes

Copilot AI mentioned this pull request Nov 28, 2025

Address code review feedback for RAG agent implementation #95

Closed

Update python/rag_agent.py

47b5098

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings November 28, 2025 22:03

Copilot started reviewing on behalf of xujiongze November 28, 2025 22:04 View session

xujiongze and others added 3 commits November 28, 2025 23:04

Update python/app.py

4102c01

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update python/README.md

e8a359e

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update python/app.py

7dd26ed

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot AI mentioned this pull request Nov 28, 2025

Fix port number in README.md to match app.py configuration #96

Closed

Update python/app.py

e0851af

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot AI mentioned this pull request Nov 28, 2025

Add input validation limits for conversation history in /chat endpoint #97

Closed

Copilot finished reviewing on behalf of xujiongze November 28, 2025 22:08

Copilot AI mentioned this pull request Nov 28, 2025

Rename page parameter to page_url in extract_page_content for clarity #98

Closed

Update python/README.md

8fba994

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot AI mentioned this pull request Nov 28, 2025

Fix port number in README.md to match app.py configuration #99

Closed

Copilot AI reviewed Nov 28, 2025

View reviewed changes

Update python/app.py

a60c9ea

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings November 29, 2025 09:52

Copilot started reviewing on behalf of xujiongze November 29, 2025 09:52 View session

xujiongze and others added 2 commits November 29, 2025 10:52

Update python/app.py

134ead5

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update python/vectorize_documents.py

d70733d

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot AI mentioned this pull request Nov 29, 2025

[WIP] WIP Address feedback from review on Python RAG agent feature PR #100

Closed

Copilot AI mentioned this pull request Nov 29, 2025

[WIP] Address feedback on feature/python/rag agent PR #101

Closed

Copilot finished reviewing on behalf of xujiongze November 29, 2025 09:54

Copilot AI reviewed Nov 29, 2025

View reviewed changes

python/app.py Show resolved Hide resolved

xujiongze closed this Nov 29, 2025

	print(f"Splitting documents into chunks (size={chunk_size}, overlap={chunk_overlap})...")
	print("Splitting documents into chunks (size=700, overlap=120)...")

Conversation

xujiongze commented Nov 28, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xujiongze commented Nov 28, 2025

Uh oh!

Copilot AI commented Nov 28, 2025

Uh oh!

Copilot AI commented Nov 28, 2025

Uh oh!

Copilot AI commented Nov 28, 2025

Uh oh!

Copilot AI commented Nov 28, 2025

Uh oh!

Copilot AI commented Nov 28, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI commented Nov 29, 2025

Uh oh!

Copilot AI commented Nov 29, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants