Comprehensive documentation of all technologies, libraries, and frameworks used in the Multi-Modal Academic Research System.
- Overview
- Core Technologies
- AI and Machine Learning
- Search and Vector Database
- Data Collection and Processing
- Web Framework and UI
- Supporting Libraries
- Development Tools
- Version Compatibility
- Architecture Decisions
- Alternative Technologies Considered
The Multi-Modal Academic Research System is built on a modern Python stack, leveraging state-of-the-art AI models, vector search capabilities, and a user-friendly web interface. The technology choices prioritize:
- Free and Open Source: Minimize costs, maximize accessibility
- Local-First: Data and processing remain on your machine
- Modern AI: Latest language models and embedding techniques
- Scalability: Capable of handling thousands of documents
- Ease of Use: Simple installation and intuitive interface
Version: 3.9+ (3.11 recommended)
Why Python:
- Rich ecosystem for AI/ML and data processing
- Extensive libraries for academic data collection
- Strong community support
- Easy integration with modern AI APIs
- Excellent for rapid prototyping and development
Why Python 3.9+:
- Type hints support (PEP 585, 604)
- Dictionary merge operators
- String methods improvements
- Required for modern dependency versions
Official: https://www.python.org/
Purpose: AI-powered text generation and vision analysis
Why Gemini:
- Free tier available (no credit card required)
- State-of-the-art multimodal capabilities
- Excellent for academic content analysis
- Vision API for diagram understanding
- Competitive with GPT-4 in many tasks
- Lower latency than other commercial APIs
Models Used:
-
Gemini 1.5 Pro (
gemini-1.5-pro-latest):- Research query responses
- Text synthesis and analysis
- Citation generation
- Context window: 1M tokens
-
Gemini 1.5 Flash (
gemini-1.5-flash):- PDF diagram analysis
- Image description
- Faster, lower cost for vision tasks
Alternatives Considered:
- OpenAI GPT-4: Higher cost, requires payment
- Anthropic Claude: Limited free tier
- Local models (LLaMA): Higher compute requirements
Official: https://ai.google.dev/
Purpose: AI orchestration and prompt management framework
Why LangChain:
- Industry-standard for LLM applications
- Built-in memory management for conversations
- Excellent prompt templating
- Easy integration with multiple LLM providers
- Strong community and documentation
- Modular architecture for extensibility
Key Features Used:
ChatGoogleGenerativeAI: Gemini integrationConversationBufferWindowMemory: Conversation trackingPromptTemplate: Structured prompts for research queries- Chain composition for multi-step reasoning
LangChain Google GenAI (1.0.10):
- Official LangChain integration for Google's AI
- Seamless Gemini model support
- Streaming responses
- Function calling support
Alternatives Considered:
- Direct API calls: More work, less abstraction
- LlamaIndex: Different focus (more on indexing)
- Haystack: Less mature Gemini integration
Official: https://python.langchain.com/
Purpose: Generate semantic embeddings for vector search
Why Sentence Transformers:
- State-of-the-art semantic similarity models
- Fast inference on CPU
- Excellent for retrieval tasks
- Easy to use and well-documented
- Large collection of pre-trained models
- Active development and updates
Model Used: all-MiniLM-L6-v2
- Dimensions: 384
- Speed: ~14,000 sentences/second (CPU)
- Performance: Excellent quality/speed tradeoff
- Size: ~90MB download
- Best for: Semantic search in medium-large corpora
Why all-MiniLM-L6-v2:
- Optimal balance of speed and quality
- Small embedding size (384d) = lower storage
- Fast retrieval with k-NN search
- Good performance on academic text
- Widely used and battle-tested
Alternative Models Available:
all-mpnet-base-v2: Higher quality (768d), slowerall-MiniLM-L12-v2: Similar size, slightly better qualitymulti-qa-mpnet-base-dot-v1: Optimized for Q&A tasks
Official: https://www.sbert.net/
Purpose: Vector database, full-text search, and document indexing
Why OpenSearch:
- Open source: Fork of Elasticsearch, Apache 2.0 license
- Free to use: No licensing costs
- Hybrid search: Combines keyword (BM25) and vector (k-NN) search
- Scalable: Handles millions of documents
- Feature-rich: Aggregations, filtering, faceting
- Local deployment: Runs on your machine, full control
- Active development: AWS-backed, regular updates
Key Features Used:
-
k-NN Vector Search:
- 384-dimensional embeddings
- Cosine similarity for semantic matching
- Fast approximate nearest neighbors
-
Full-Text Search:
- BM25 ranking algorithm
- Multi-field queries
- Field boosting (title^2)
-
Hybrid Search:
- Combines keyword and vector results
- Best of both worlds: precise and semantic
- Configurable weighting
-
Index Management:
- Custom mappings for multi-modal content
- Nested objects for citations
- Keyword arrays for metadata
Schema Design:
{
'content_type': 'keyword', # paper/video/podcast
'title': 'text', # Searchable title
'abstract': 'text', # Paper abstract
'content': 'text', # Main content
'authors': 'keyword', # Author list
'publication_date': 'date', # Temporal filtering
'embedding': 'knn_vector', # 384-dim semantic vector
'diagram_descriptions': 'text', # From Gemini Vision
'key_concepts': 'keyword', # Extracted concepts
'citations': 'nested' # Citation tracking
}Deployment: Docker container for local development
docker run -p 9200:9200 -e "discovery.type=single-node" \
opensearchproject/opensearch:latestAlternatives Considered:
- Elasticsearch: Proprietary license, OpenSearch is the open fork
- Pinecone: Cloud-only, costs money, less control
- Weaviate: Good but more complex setup
- Qdrant: Excellent but newer, smaller ecosystem
- ChromaDB: Included but not primary (simpler use cases)
- FAISS: No full-text search, just vectors
- Milvus: More complex, higher overhead
Why Not ChromaDB:
- Included in dependencies (
chromadb==0.4.20) - Could be used for simpler deployments
- OpenSearch chosen for hybrid search capabilities
- ChromaDB better for pure vector search
Official: https://opensearch.org/
Purpose: Collect papers from ArXiv preprint server
Why ArXiv:
- Largest open-access preprint repository
- Covers CS, physics, math, and more
- Free API, no authentication required
- Rich metadata (authors, abstract, categories)
- Direct PDF download links
Usage:
from arxiv import Search, Client
client.search(query="machine learning", max_results=10)Official: https://arxiv.org/
Purpose: Collect papers from Google Scholar and Semantic Scholar
Why Scholarly:
- Access to Google Scholar's vast database
- Semantic Scholar integration
- Citation information
- No API key required (uses scraping)
- Author profiles and metrics
Limitations:
- Rate limiting required (avoid IP bans)
- May need proxies for heavy use
- Less reliable than official APIs
Usage:
from scholarly import scholarly
search = scholarly.search_pubs('deep learning')Official: https://github.com/scholarly-python-package/scholarly
Purpose: Extract transcripts from educational YouTube videos
Why YouTube Transcript API:
- Free and no API key required
- Accesses official YouTube transcripts
- Supports multiple languages
- Auto-generated and manual captions
- Timestamp information
Usage:
from youtube_transcript_api import YouTubeTranscriptApi
transcript = YouTubeTranscriptApi.get_transcript(video_id)Alternatives:
- YouTube Data API v3: Requires API key, quota limits
- pytube (0.15.0): Video metadata extraction
- yt-dlp (2024.3.10): Alternative downloader
Official: https://github.com/jdepoix/youtube-transcript-api
Purpose: Parse RSS/Atom feeds for podcast collection
Why Feedparser:
- Universal feed parser (RSS, Atom, RDF)
- Robust and battle-tested
- Handles malformed feeds gracefully
- Extract episode metadata
- Widely used and maintained
Usage:
import feedparser
feed = feedparser.parse('https://podcast.rss.url')Official: https://feedparser.readthedocs.io/
Purpose: Extract text and metadata from PDF files
Why PyPDF:
- Pure Python, no external dependencies
- Actively maintained (formerly PyPDF2)
- Good text extraction
- PDF metadata parsing
- Page-by-page processing
Limitations:
- May struggle with complex layouts
- Scanned PDFs need OCR
Official: https://pypdf.readthedocs.io/
Purpose: Advanced PDF processing and image extraction
Why PyMuPDF (fitz):
- Fast C-based implementation
- Excellent image extraction
- Better handling of complex PDFs
- Page rendering capabilities
- Extract diagrams and figures
Why Both PyPDF and PyMuPDF:
- Fallback mechanism: Try PyMuPDF first, fallback to PyPDF
- Different PDFs work better with different libraries
- Maximize successful text extraction
Usage:
import fitz # PyMuPDF
doc = fitz.open("paper.pdf")
images = doc[page_num].get_images()Official: https://pymupdf.readthedocs.io/
Purpose: Audio transcription for videos and podcasts
Why Whisper:
- State-of-the-art speech recognition
- Multilingual support (99 languages)
- Open source from OpenAI
- Works offline (local processing)
- High accuracy on academic content
Models:
tiny,base: Fast, lower accuracysmall,medium: Balancedlarge: Best accuracy, slower
Note: Currently included but not fully integrated. YouTube uses existing transcripts.
Official: https://github.com/openai/whisper
Purpose: Web-based user interface
Why Gradio:
- Rapid development: Build UIs in minutes
- Beautiful defaults: Professional look out of the box
- Multi-tab support: Organize complex applications
- Real-time updates: Live feedback during processing
- Share functionality: Create public URLs instantly
- Python-native: No JavaScript required
- FastAPI integration: Built on modern async framework
Key Features Used:
- Tabbed interface (Research, Data Collection, Citations, Settings)
- Text input/output components
- Buttons and interactive elements
- Real-time status updates
- File upload capabilities
- Markdown rendering for responses
UI Structure:
with gr.Blocks() as app:
with gr.Tab("Research"):
# Query interface
with gr.Tab("Data Collection"):
# Collection controls
with gr.Tab("Citation Manager"):
# Citation export
with gr.Tab("Settings"):
# System configurationWhy Gradio over alternatives:
- Streamlit: Less flexible layout, page reloads
- Dash: More complex, steeper learning curve
- Flask/FastAPI: Would require frontend development
- Jupyter: Not ideal for production deployment
FastAPI (0.109.0) + Uvicorn (0.27.0):
- Gradio runs on FastAPI (modern async framework)
- Uvicorn is the ASGI server
- High performance, async support
- Easy to extend with custom API endpoints
Official: https://gradio.app/
Purpose: Environment variable management
Why python-dotenv:
- Load
.envfiles automatically - Keep secrets out of code
- Easy configuration management
- Development/production separation
Usage:
from dotenv import load_dotenv
load_dotenv()
api_key = os.getenv('GEMINI_API_KEY')Purpose: HTTP client for API calls
Why Requests:
- De facto standard for HTTP in Python
- Simple, elegant API
- Robust error handling
- Session management
- Wide adoption and support
Purpose: HTML and XML parsing
Why BeautifulSoup:
- Intuitive API for parsing HTML
- Robust handling of malformed HTML
- Multiple parser backends
- Used for web scraping when APIs unavailable
Purpose: Data manipulation and analysis
Why Pandas:
- DataFrame structure for tabular data
- Easy data transformation
- CSV export capabilities
- Integration with other tools
- Citation export and management
Purpose: Numerical computing foundation
Why NumPy:
- Dependency for many libraries (Pandas, Sentence Transformers)
- Array operations for embeddings
- Mathematical computations
- Version <2.0 for compatibility
Purpose: Progress bars for long operations
Why tqdm:
- Visual feedback during collection/processing
- Works in terminal and notebooks
- Minimal overhead
- Easy integration
Purpose: Audio processing and manipulation
Why Pydub:
- Simple API for audio operations
- Format conversion
- Audio segmentation
- Required for Whisper integration
Purpose: YouTube video metadata extraction
Why yt-dlp:
- Active fork of youtube-dl
- Regular updates for YouTube changes
- Robust error handling
- Alternative to pytube
- More reliable than pytube
- Git: Source control and collaboration
- .gitignore: Excludes venv, .env, data, logs
- venv: Isolate project dependencies
- Platform-independent
- Built into Python 3.3+
- pip: Python package installer
- requirements.txt: Dependency specification
Minimum: Python 3.9 Recommended: Python 3.11
Why 3.9+:
- Type hint improvements (PEP 585, 604)
- Dictionary merge operators (|)
- String methods (removeprefix, removesuffix)
- Required for modern packages
Why 3.11 recommended:
- 10-60% faster than 3.10
- Better error messages
- Improved typing features
- Current stable release
Must be compatible:
opensearch-pywith OpenSearch server versionlangchainwithlangchain-google-genainumpy <2.0(compatibility constraint)sentence-transformerswith PyTorch (auto-installed)
NumPy 2.0:
- Many libraries not yet compatible
- Pinned to <2.0 in requirements
- Will update when ecosystem catches up
LangChain:
- Rapid development, frequent updates
- Pin to 0.2.x for stability
- Check migration guides for major versions
OpenSearch:
- Compatible with 1.x and 2.x servers
- Client version should match server major version
Decision: Combine BM25 keyword search with k-NN vector search
Rationale:
- Keyword search (BM25): Precise matching, handles specific terms
- Vector search (k-NN): Semantic similarity, understands meaning
- Together: Best of both worlds
Example:
- Query: "neural network training"
- BM25 finds: Papers with exact phrase
- k-NN finds: Papers about "model optimization", "gradient descent"
- Hybrid: Comprehensive results covering all aspects
Decision: Run everything locally (OpenSearch, embeddings, processing)
Rationale:
- Privacy: Your data stays on your machine
- Cost: No cloud fees, only API costs for Gemini
- Control: Full control over infrastructure
- Offline: Works without constant internet
- Speed: No network latency for search
Cloud usage limited to:
- Gemini API calls (can be minimized)
- Public Gradio link (optional)
Decision: Use Google Gemini as primary LLM
Rationale:
- Free tier: No credit card, generous limits
- Multimodal: Native vision support
- Quality: Competitive with GPT-4
- Context: 1M token window
- Speed: Fast response times
- API: Simple, well-documented
Trade-offs:
- GPT-4 may be slightly better for some tasks
- OpenAI ecosystem is more mature
- But cost and accessibility favor Gemini
Decision: Use Gradio for UI instead of React/Vue
Rationale:
- Speed: Build in hours, not days
- Python-only: No JavaScript needed
- Good enough: Professional appearance
- Share feature: Instant public URLs
- Maintenance: Less code to maintain
Trade-offs:
- Less customization than custom frontend
- Tied to Gradio's update cycle
- Limited styling options
- But: Faster development and deployment
Evaluated:
- Pinecone: Cloud-only, paid service
- Weaviate: More setup complexity
- Qdrant: Excellent but newer
- Milvus: Higher operational overhead
- FAISS: No full-text search
- ChromaDB: Simpler but less features
Chosen: OpenSearch (hybrid search, open source, free)
Evaluated:
- OpenAI GPT-4: Better quality, higher cost, requires payment
- Anthropic Claude: Good quality, limited free tier
- Cohere: Commercial focus, pricing model
- Local models (LLaMA, Mistral): No API costs, high compute requirements
Chosen: Google Gemini (free, multimodal, good quality)
Evaluated:
- Direct API calls: More control, more work
- LlamaIndex: Better for pure indexing
- Haystack: Less mature Gemini support
Chosen: LangChain (mature, flexible, good Gemini integration)
Evaluated:
- Streamlit: Popular but less flexible
- Dash: Plotly-based, more complex
- Flask + React: Full control, more development
- Jupyter: Not production-ready
Chosen: Gradio (fast development, good features)
Enhanced Search:
- Reranking models (e.g., cross-encoders)
- Query expansion techniques
- Multi-vector search
More Data Sources:
- PubMed Central integration
- Springer/Nature APIs (if available)
- Podcast platforms (Spotify, Apple)
Advanced AI:
- Local LLM support (LLaMA, Mistral)
- Fine-tuned models for academic text
- Multi-agent systems
Performance:
- GPU acceleration for embeddings
- Distributed OpenSearch
- Caching layer (Redis)
Features:
- Document summarization
- Automatic concept extraction
- Knowledge graph generation
- Collaborative features
Multi-Modal Academic Research System
│
├── AI & ML
│ ├── google-generativeai (Gemini API)
│ ├── google-genai (newer SDK)
│ ├── langchain (orchestration)
│ ├── langchain-google-genai (integration)
│ └── sentence-transformers (embeddings)
│ └── PyTorch (auto-installed)
│
├── Search & Database
│ ├── opensearch-py (vector DB)
│ └── chromadb (alternative)
│
├── Data Collection
│ ├── arxiv (papers)
│ ├── scholarly (Google Scholar)
│ ├── youtube-transcript-api (YouTube)
│ ├── pytube (video metadata)
│ ├── yt-dlp (alternative downloader)
│ └── feedparser (podcasts)
│
├── Document Processing
│ ├── pypdf (PDF text)
│ ├── PyMuPDF (PDF images)
│ ├── whisper (audio transcription)
│ └── pydub (audio processing)
│
├── Web & UI
│ ├── gradio (UI framework)
│ ├── fastapi (backend)
│ └── uvicorn (ASGI server)
│
└── Supporting
├── python-dotenv (config)
├── requests (HTTP)
├── beautifulsoup4 (parsing)
├── pandas (data manipulation)
├── numpy (numerical computing)
└── tqdm (progress bars)
Python Packages: ~2GB total
- PyTorch (sentence-transformers dependency): ~800MB
- Gradio + FastAPI: ~200MB
- OpenSearch client: ~50MB
- Other packages: ~950MB combined
Models (downloaded on first use):
- all-MiniLM-L6-v2 embeddings: ~90MB
- Whisper models (optional):
- tiny: ~75MB
- base: ~150MB
- small: ~500MB
- medium: ~1.5GB
- large: ~3GB
Docker Images:
- OpenSearch: ~1.2GB
Total First Install: ~3-4GB (without Whisper models)
On typical broadband (50 Mbps):
- Python packages: 5-10 minutes
- Embedding model: 1-2 minutes
- OpenSearch Docker: 2-3 minutes
- Total: 10-15 minutes
The Multi-Modal Academic Research System leverages a modern, open-source technology stack designed for:
- Accessibility: Free tools, no paywalls
- Quality: State-of-the-art AI and search
- Privacy: Local-first architecture
- Flexibility: Modular, extensible design
- Usability: Simple installation, intuitive interface
Core Stack:
- Python 3.11: Modern language features
- Google Gemini: Free, powerful AI
- OpenSearch: Hybrid search capabilities
- LangChain: AI orchestration
- Gradio: Rapid UI development
- Sentence Transformers: Semantic embeddings
This technology combination enables sophisticated academic research workflows while maintaining simplicity and cost-effectiveness.
For more information: