Skip to content

Feature/python/rag agent#94

Closed
xujiongze wants to merge 32 commits intomainfrom
feature/python/rag-agent
Closed

Feature/python/rag agent#94
xujiongze wants to merge 32 commits intomainfrom
feature/python/rag-agent

Conversation

@xujiongze
Copy link
Copy Markdown
Collaborator

Updated!

xujiongze and others added 23 commits October 22, 2025 17:57
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings November 28, 2025 13:02
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a comprehensive RAG (Retrieval-Augmented Generation) agent for the IOC.EAssistant chatbot, transforming it from a simple echo service into a fully functional AI-powered assistant. The implementation supports dual model providers (OpenAI and Ollama), integrates document vectorization with ChromaDB, and provides a stateless REST API that interfaces with a .NET backend for conversation history management.

Key changes:

  • RAG Implementation: Added complete RAG pipeline with LangChain agent, tool calling, MMR-based retrieval, and web search fallback
  • Dual Provider Support: Flexible architecture supporting both OpenAI (cloud) and Ollama (local) model providers
  • Document Processing: Implemented crawler enhancements and vectorization pipeline with metadata extraction and ChromaDB storage

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 19 comments.

Show a summary per file
File Description
python/web.py Deleted placeholder Flask app with echo functionality
python/app.py New Flask app with RAG agent integration, conversation history support, and OpenAI-compatible API format
python/rag_agent.py Core RAG agent implementation with LangChain, dual provider support, MMR retrieval, and web search tools
python/vectorize_documents.py Document vectorization pipeline with metadata extraction, chunking, and ChromaDB persistence
python/utils.py Utility functions for GPU configuration and document formatting
python/crawler.py Enhanced to assign document types ('general' vs 'noticia') for filtered retrieval
python/requirements.txt Added LangChain ecosystem dependencies and updated existing packages
python/README.md Comprehensive documentation covering setup, providers, API usage, and architecture
python/.gitignore Added chroma_db/ to ignored directories
python/.env.example Added configuration for model providers, embeddings, LLM models, and database settings

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

python/app.py Outdated
Comment on lines +35 to +51
subprocess.run(
[sys.executable, "crawler.py"],
cwd=os.path.dirname(os.path.abspath(__file__)),
capture_output=True,
text=True,
timeout=600 # 10 minutes timeout
)

# If ChromaDB doesn't exist, run vectorize_documents
if not chroma_db_exists:
print("vectorizing documents...")
subprocess.run(
[sys.executable, "vectorize_documents.py"],
cwd=os.path.dirname(os.path.abspath(__file__)),
capture_output=True,
text=True,
timeout=600 # 10 minutes timeout
Copy link

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The subprocess call to crawler.py doesn't validate or sanitize environment variables that could affect its execution. If an attacker can control environment variables (e.g., through Docker), they could potentially inject malicious commands. Consider validating inputs or using more secure subprocess invocation methods.

Suggested change
subprocess.run(
[sys.executable, "crawler.py"],
cwd=os.path.dirname(os.path.abspath(__file__)),
capture_output=True,
text=True,
timeout=600 # 10 minutes timeout
)
# If ChromaDB doesn't exist, run vectorize_documents
if not chroma_db_exists:
print("vectorizing documents...")
subprocess.run(
[sys.executable, "vectorize_documents.py"],
cwd=os.path.dirname(os.path.abspath(__file__)),
capture_output=True,
text=True,
timeout=600 # 10 minutes timeout
# Sanitize environment for subprocess
safe_env = {k: v for k, v in os.environ.items() if k not in ["PYTHONPATH", "PATH"]}
safe_env["PATH"] = "/usr/bin:/bin" # Set a safe default PATH
subprocess.run(
[sys.executable, "crawler.py"],
cwd=os.path.dirname(os.path.abspath(__file__)),
capture_output=True,
text=True,
timeout=600, # 10 minutes timeout
env=safe_env
)
# If ChromaDB doesn't exist, run vectorize_documents
if not chroma_db_exists:
print("vectorizing documents...")
# Sanitize environment for subprocess
safe_env = {k: v for k, v in os.environ.items() if k not in ["PYTHONPATH", "PATH"]}
safe_env["PATH"] = "/usr/bin:/bin" # Set a safe default PATH
subprocess.run(
[sys.executable, "vectorize_documents.py"],
cwd=os.path.dirname(os.path.abspath(__file__)),
capture_output=True,
text=True,
timeout=600, # 10 minutes timeout
env=safe_env

Copilot uses AI. Check for mistakes.
python/app.py Outdated
Comment on lines +35 to +41
subprocess.run(
[sys.executable, "crawler.py"],
cwd=os.path.dirname(os.path.abspath(__file__)),
capture_output=True,
text=True,
timeout=600 # 10 minutes timeout
)
Copy link

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running the crawler synchronously during application startup with a 10-minute timeout can cause deployment delays and potential startup timeouts. Consider running this as a separate background job or scheduled task instead of blocking application startup.

Copilot uses AI. Check for mistakes.
@xujiongze
Copy link
Copy Markdown
Collaborator Author

@copilot open a new pull request to apply changes based on the comments in this thread

Copy link
Copy Markdown

Copilot AI commented Nov 28, 2025

@xujiongze I've opened a new pull request, #95, to work on those changes. Once the pull request is ready, I'll request review from you.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings November 28, 2025 22:03
xujiongze and others added 3 commits November 28, 2025 23:04
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown

Copilot AI commented Nov 28, 2025

@xujiongze I've opened a new pull request, #96, to work on those changes. Once the pull request is ready, I'll request review from you.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown

Copilot AI commented Nov 28, 2025

@xujiongze I've opened a new pull request, #97, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link
Copy Markdown

Copilot AI commented Nov 28, 2025

@xujiongze I've opened a new pull request, #98, to work on those changes. Once the pull request is ready, I'll request review from you.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown

Copilot AI commented Nov 28, 2025

@xujiongze I've opened a new pull request, #99, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 24 comments.

Comments suppressed due to low confidence (1)

python/app.py:7

  • Import of 'asyncio' is not used.
from datetime import datetime

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

print(f" Date: {sample.metadata.get('date', 'N/A')}")
print(f" Content preview: {sample.page_content[:150]}...\n")

print(f"Splitting documents into chunks (size={chunk_size}, overlap={chunk_overlap})...")
Copy link

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The comment on line 196 says "size={chunk_size}, overlap={chunk_overlap}" but the actual values used are hardcoded to 700 and 120 (lines 200-201), not the parameters. Update the comment to reflect the actual values or fix the code to use the parameters.

Suggested change
print(f"Splitting documents into chunks (size={chunk_size}, overlap={chunk_overlap})...")
print("Splitting documents into chunks (size=700, overlap=120)...")

Copilot uses AI. Check for mistakes.
Comment on lines +262 to +267
self.agent = create_agent(self.llm, self.tools, system_prompt=system_prompt)
self.use_agent = True
print("Using agent mode with tool calling")
except NotImplementedError:
print("Agent mode not supported, langchain version may be outdated.")
self.use_agent = False
Copy link

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The agent initialization always sets self.use_agent = True on line 263, but this variable is never used anywhere in the code. The code then proceeds to call self.agent.invoke() without checking if the agent was successfully created. If create_agent raises NotImplementedError, the exception is caught and self.use_agent is set to False, but in the query and query_with_history methods, the code still attempts to invoke self.agent without checking this flag, which will cause an AttributeError.

Add a check in the query methods:

if not self.use_agent or not hasattr(self, 'agent'):
    # Use fallback directly
    return self._fallback_rag_search(question)

try:
    response = self.agent.invoke({"messages": messages})
    ...

Copilot uses AI. Check for mistakes.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings November 29, 2025 09:52
xujiongze and others added 2 commits November 29, 2025 10:52
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown

Copilot AI commented Nov 29, 2025

@xujiongze I've opened a new pull request, #100, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link
Copy Markdown

Copilot AI commented Nov 29, 2025

@xujiongze I've opened a new pull request, #101, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@xujiongze xujiongze closed this Nov 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants