A unified microservice for research paper discovery and intelligent document processing. Combines ArXiv paper search with advanced RAG (Retrieval Augmented Generation) capabilities powered by Chunkr AI.
- π Search ArXiv Papers: Full-text search with sorting and pagination
- π Trending Papers: Discover recent papers from the last month
- β Recommended Papers: Curated foundational papers in CS/ML
- π€ Chunkr AI Integration: Advanced document intelligence with layout analysis
- π§ Intelligent Chunking: Layout-aware document segmentation
- π Semantic Search: Search within paper content using processed chunks
- π Annotation Support: Index and search through paper annotations
- π Multi-format Support: PDFs, Word, Excel, PowerPoint, and images
- ποΈ Clean Architecture: Model-Controller-Service pattern
- π Structured Response: Consistent JSON format with metadata
- π§ Configurable: Environment-based configuration
- π― Demo Ready: Perfect for hackathon demonstrations
DataEngineX/
βββ app/
β βββ models/ # Pydantic models for request/response
β β βββ paper.py # ArXiv paper models
β β βββ rag_models.py # RAG-specific models
β βββ controllers/ # Business logic and request handling
β β βββ paper_controller.py # ArXiv paper operations
β β βββ rag_controller.py # RAG operations
β βββ services/ # External API integrations
β β βββ arxiv_service.py # ArXiv API integration
β β βββ chunkr_service.py # Chunkr AI integration
β βββ utils/ # Configuration and utilities
β βββ config.py # Enhanced configuration
βββ main.py # FastAPI application with all endpoints
βββ requirements.txt # Updated dependencies
βββ README.md # This file
- Python 3.8+
- pip
- Chunkr AI API key (optional - demo mode available)
- Supabase account (for data storage)
- Clone the repository:
git clone https://github.com/your-username/DataEngineX.git
cd DataEngineX- Install dependencies:
pip install -r requirements.txt- Set up environment variables:
# Create .env file
cp .env.example .env
# Edit .env with your credentials
CHUNKR_API_KEY=your_chunkr_api_key
SUPABASE_URL=your_supabase_url
SUPABASE_KEY=your_supabase_anon_key- Run the application:
python main.pyThe API will be available at http://localhost:8000
Once running, visit:
- Interactive API docs:
http://localhost:8000/docs - ReDoc documentation:
http://localhost:8000/redoc
GET /api/arxiv/search?query=machine learning&max_results=10
GET /api/papers/recommended?limit=20
GET /api/papers/trending?limit=15
POST /api/rag/papers/{paper_id}/index
{
"paper_id": "2301.00001",
"title": "Paper Title",
"authors": ["Author One"],
"pdf_url": "https://arxiv.org/pdf/2301.00001.pdf"
}
POST /api/rag/papers/{paper_id}/search
{
"query": "neural networks",
"limit": 5
}
POST /api/rag/papers/{paper_id}/annotations/{annotation_id}
{
"annotation_id": "note_1",
"content": "Important finding about attention mechanisms",
"user_id": "researcher_123"
}
GET /api/rag/stats
POST /api/demo/complete-workflow?query=transformers&paper_index=0&search_query=attention
This single endpoint demonstrates the entire workflow:
- Search ArXiv for papers matching "transformers"
- Index the first paper using Chunkr AI
- Search within the paper for "attention"
- Return complete results showing the full pipeline
The application uses environment variables for configuration. Create a .env file:
# Chunkr AI Configuration
CHUNKR_API_KEY=your_chunkr_api_key
# Supabase Configuration
SUPABASE_URL=your_supabase_project_url
SUPABASE_KEY=your_supabase_anon_key
# Optional: Custom settings
DEBUG=True- Advanced Document Processing: Layout-aware chunking instead of simple text splitting
- Multi-format Support: Process PDFs, Word docs, presentations, and images
- OCR Capabilities: Extract text from scanned documents
- Vision Language Models: Enhanced document understanding
- Single Service: Both paper discovery and RAG in one service
- Reduced Complexity: No more managing multiple microservices
- Better Performance: Eliminated network overhead between services
- Shared Dependencies: Unified FastAPI and Supabase integration
- Complete Workflow Endpoint: Perfect for live demonstrations
- Fallback Demo Mode: Works even without Chunkr API key
- System Statistics: Real-time stats for dashboards
- Enhanced Documentation: Better API docs and examples
- New Models: Add to
app/models/rag_models.py - New Services: Extend
app/services/chunkr_service.py - New Endpoints: Add controller methods and route them in
main.py
-- Paper chunks table
CREATE TABLE paper_chunks (
id SERIAL PRIMARY KEY,
paper_id TEXT NOT NULL,
chunk_id TEXT UNIQUE NOT NULL,
content TEXT NOT NULL,
page_number INTEGER,
section TEXT,
chunk_index INTEGER,
metadata JSONB,
created_at TIMESTAMP DEFAULT NOW()
);
-- Annotations table
CREATE TABLE annotations (
id TEXT PRIMARY KEY,
paper_id TEXT NOT NULL,
content TEXT NOT NULL,
highlight_text TEXT,
user_id TEXT,
page_number INTEGER,
created_at TIMESTAMP DEFAULT NOW()
);- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
This unified service provides everything you need for an impressive research paper processing demo:
- Quick Setup: Get running in minutes
- Demo Mode: Works without external API keys
- Complete Workflow: End-to-end paper discovery and processing
- Modern Tech Stack: FastAPI, Chunkr AI, Supabase
- Great Documentation: Swagger UI for easy testing
- π Integration with more academic databases (PubMed, IEEE, etc.)
- π― Advanced search filters and faceted search
- π Citation analysis and paper metrics
- π€ AI-powered paper recommendations based on content
- π Analytics dashboard for usage metrics
- π Cross-paper semantic search capabilities