feat: Add PDF support to RAG knowledge base#4
Open
zhaog100 wants to merge 1 commit intoversila22:mainfrom
Open
feat: Add PDF support to RAG knowledge base#4zhaog100 wants to merge 1 commit intoversila22:mainfrom
zhaog100 wants to merge 1 commit intoversila22:mainfrom
Conversation
## Features ### ✅ PDF Support - Add bot/pdf_parser.py - PDF text extraction using PyPDF2 - Extract text page by page for RAG indexing - Automatic PDF file detection in knowledge/ folder - Fallback handling if PyPDF2 not installed ### ✅ Enhanced RAG System - Modified bot/rag.py to support both Markdown and PDF - Improved logging for better debugging - Better error handling ### ✅ Documentation Updates - Updated README.md with PDF usage instructions - Enhanced knowledge/README.md with PDF guidelines - Added installation instructions for PyPDF2 ## Changes - bot/pdf_parser.py (new, 90 lines) - PDF extraction module - bot/rag.py (modified) - Added PDF support - requirements.txt (modified) - Added PyPDF2>=3.0.0 - README.md (modified) - Added PDF documentation - knowledge/README.md (modified) - Added PDF usage guide ## Usage Add PDF files to knowledge/ folder and restart the bot. Enhances versila22#2
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Enhances the existing RAG system with PDF file support, allowing device manuals and documentation to be indexed alongside Markdown files.
Features
✅ PDF Support
knowledge/folder✅ Enhanced RAG System
bot/rag.pyto support both Markdown and PDF files✅ Documentation
README.mdwith PDF usage instructionsknowledge/README.mdwith PDF guidelinesChanges
bot/pdf_parser.pybot/rag.pyrequirements.txtREADME.mdknowledge/README.mdUsage Example
The bot automatically:
knowledge/Test Results
Manual Testing
✅ PDF files detected in knowledge/ folder
✅ Text extraction successful (PyPDF2)
✅ Embeddings calculated correctly
✅ Search returns relevant results
✅ Fallback works when PyPDF2 not installed
Performance
Installation
PDF support is optional and backward compatible:
Benefits
Example Use Case
Scenario: Parents have a new Samsung TV with a complex remote.
Before: Need to manually transcribe the manual into Markdown.
After:
knowledge/folderRelated
Enhances #2 (Local RAG for device manuals)
Checklist
Files Changed: 5 files (+170 lines, -11 lines)
Dependencies: PyPDF2>=3.0.0 (optional)
Tested on: Python 3.12, PyPDF2 3.0.1, Google Gemini API