Skip to content

Service gathers data and research paper metadata from multiple APIs, returning structured results for easy integration.

License

Notifications You must be signed in to change notification settings

MetaHackathon/DataEngineX

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

22 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

DataEngineX

A unified microservice for research paper discovery and intelligent document processing. Combines ArXiv paper search with advanced RAG (Retrieval Augmented Generation) capabilities powered by Chunkr AI.

πŸš€ Features

Paper Discovery

  • πŸ” Search ArXiv Papers: Full-text search with sorting and pagination
  • πŸ“ˆ Trending Papers: Discover recent papers from the last month
  • ⭐ Recommended Papers: Curated foundational papers in CS/ML

RAG Processing (NEW!)

  • πŸ€– Chunkr AI Integration: Advanced document intelligence with layout analysis
  • 🧠 Intelligent Chunking: Layout-aware document segmentation
  • πŸ” Semantic Search: Search within paper content using processed chunks
  • πŸ“ Annotation Support: Index and search through paper annotations
  • πŸ“Š Multi-format Support: PDFs, Word, Excel, PowerPoint, and images

Architecture

  • πŸ—οΈ Clean Architecture: Model-Controller-Service pattern
  • πŸ“Š Structured Response: Consistent JSON format with metadata
  • πŸ”§ Configurable: Environment-based configuration
  • 🎯 Demo Ready: Perfect for hackathon demonstrations

Architecture

DataEngineX/
β”œβ”€β”€ app/
β”‚   β”œβ”€β”€ models/              # Pydantic models for request/response
β”‚   β”‚   β”œβ”€β”€ paper.py         # ArXiv paper models
β”‚   β”‚   └── rag_models.py    # RAG-specific models
β”‚   β”œβ”€β”€ controllers/         # Business logic and request handling
β”‚   β”‚   β”œβ”€β”€ paper_controller.py  # ArXiv paper operations
β”‚   β”‚   └── rag_controller.py    # RAG operations
β”‚   β”œβ”€β”€ services/            # External API integrations
β”‚   β”‚   β”œβ”€β”€ arxiv_service.py     # ArXiv API integration
β”‚   β”‚   └── chunkr_service.py    # Chunkr AI integration
β”‚   └── utils/              # Configuration and utilities
β”‚       └── config.py        # Enhanced configuration
β”œβ”€β”€ main.py                 # FastAPI application with all endpoints
β”œβ”€β”€ requirements.txt        # Updated dependencies
└── README.md              # This file

Quick Start

Prerequisites

  • Python 3.8+
  • pip
  • Chunkr AI API key (optional - demo mode available)
  • Supabase account (for data storage)

Installation

  1. Clone the repository:
git clone https://github.com/your-username/DataEngineX.git
cd DataEngineX
  1. Install dependencies:
pip install -r requirements.txt
  1. Set up environment variables:
# Create .env file
cp .env.example .env

# Edit .env with your credentials
CHUNKR_API_KEY=your_chunkr_api_key
SUPABASE_URL=your_supabase_url
SUPABASE_KEY=your_supabase_anon_key
  1. Run the application:
python main.py

The API will be available at http://localhost:8000

API Documentation

Once running, visit:

  • Interactive API docs: http://localhost:8000/docs
  • ReDoc documentation: http://localhost:8000/redoc

API Endpoints

πŸ“š Paper Discovery (Existing)

Search ArXiv Papers

GET /api/arxiv/search?query=machine learning&max_results=10

Get Recommended Papers

GET /api/papers/recommended?limit=20

Get Trending Papers

GET /api/papers/trending?limit=15

πŸ€– RAG Processing (New!)

Index Paper for RAG

POST /api/rag/papers/{paper_id}/index
{
  "paper_id": "2301.00001",
  "title": "Paper Title",
  "authors": ["Author One"],
  "pdf_url": "https://arxiv.org/pdf/2301.00001.pdf"
}

Search Within Paper

POST /api/rag/papers/{paper_id}/search
{
  "query": "neural networks",
  "limit": 5
}

Index Annotation

POST /api/rag/papers/{paper_id}/annotations/{annotation_id}
{
  "annotation_id": "note_1",
  "content": "Important finding about attention mechanisms",
  "user_id": "researcher_123"
}

Get System Stats

GET /api/rag/stats

🎯 Demo Workflow (Perfect for Hackathons!)

Complete End-to-End Demo

POST /api/demo/complete-workflow?query=transformers&paper_index=0&search_query=attention

This single endpoint demonstrates the entire workflow:

  1. Search ArXiv for papers matching "transformers"
  2. Index the first paper using Chunkr AI
  3. Search within the paper for "attention"
  4. Return complete results showing the full pipeline

Configuration

The application uses environment variables for configuration. Create a .env file:

# Chunkr AI Configuration
CHUNKR_API_KEY=your_chunkr_api_key

# Supabase Configuration  
SUPABASE_URL=your_supabase_project_url
SUPABASE_KEY=your_supabase_anon_key

# Optional: Custom settings
DEBUG=True

What's New in v2.0

πŸ†• Chunkr AI Integration

  • Advanced Document Processing: Layout-aware chunking instead of simple text splitting
  • Multi-format Support: Process PDFs, Word docs, presentations, and images
  • OCR Capabilities: Extract text from scanned documents
  • Vision Language Models: Enhanced document understanding

πŸ†• Unified Service Architecture

  • Single Service: Both paper discovery and RAG in one service
  • Reduced Complexity: No more managing multiple microservices
  • Better Performance: Eliminated network overhead between services
  • Shared Dependencies: Unified FastAPI and Supabase integration

πŸ†• Demo-Ready Features

  • Complete Workflow Endpoint: Perfect for live demonstrations
  • Fallback Demo Mode: Works even without Chunkr API key
  • System Statistics: Real-time stats for dashboards
  • Enhanced Documentation: Better API docs and examples

Development

Adding New Features

  1. New Models: Add to app/models/rag_models.py
  2. New Services: Extend app/services/chunkr_service.py
  3. New Endpoints: Add controller methods and route them in main.py

Database Schema (Supabase)

-- Paper chunks table
CREATE TABLE paper_chunks (
  id SERIAL PRIMARY KEY,
  paper_id TEXT NOT NULL,
  chunk_id TEXT UNIQUE NOT NULL,
  content TEXT NOT NULL,
  page_number INTEGER,
  section TEXT,
  chunk_index INTEGER,
  metadata JSONB,
  created_at TIMESTAMP DEFAULT NOW()
);

-- Annotations table
CREATE TABLE annotations (
  id TEXT PRIMARY KEY,
  paper_id TEXT NOT NULL,
  content TEXT NOT NULL,
  highlight_text TEXT,
  user_id TEXT,
  page_number INTEGER,
  created_at TIMESTAMP DEFAULT NOW()
);

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests if applicable
  5. Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸŽ‰ Perfect for Hackathons!

This unified service provides everything you need for an impressive research paper processing demo:

  • Quick Setup: Get running in minutes
  • Demo Mode: Works without external API keys
  • Complete Workflow: End-to-end paper discovery and processing
  • Modern Tech Stack: FastAPI, Chunkr AI, Supabase
  • Great Documentation: Swagger UI for easy testing

Future Enhancements

  • 🌍 Integration with more academic databases (PubMed, IEEE, etc.)
  • 🎯 Advanced search filters and faceted search
  • πŸ“ˆ Citation analysis and paper metrics
  • πŸ€– AI-powered paper recommendations based on content
  • πŸ“Š Analytics dashboard for usage metrics
  • πŸ” Cross-paper semantic search capabilities

About

Service gathers data and research paper metadata from multiple APIs, returning structured results for easy integration.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •