Technology Stack

Comprehensive documentation of all technologies, libraries, and frameworks used in the Multi-Modal Academic Research System.

Overview
Core Technologies
AI and Machine Learning
Search and Vector Database
Data Collection and Processing
Web Framework and UI
Supporting Libraries
Development Tools
Version Compatibility
Architecture Decisions
Alternative Technologies Considered

Overview

The Multi-Modal Academic Research System is built on a modern Python stack, leveraging state-of-the-art AI models, vector search capabilities, and a user-friendly web interface. The technology choices prioritize:

Free and Open Source: Minimize costs, maximize accessibility
Local-First: Data and processing remain on your machine
Modern AI: Latest language models and embedding techniques
Scalability: Capable of handling thousands of documents
Ease of Use: Simple installation and intuitive interface

Core Technologies

Python 3.9+

Version: 3.9+ (3.11 recommended)

Why Python:

Rich ecosystem for AI/ML and data processing
Extensive libraries for academic data collection
Strong community support
Easy integration with modern AI APIs
Excellent for rapid prototyping and development

Why Python 3.9+:

Type hints support (PEP 585, 604)
Dictionary merge operators
String methods improvements
Required for modern dependency versions

Official: https://www.python.org/

AI and Machine Learning

1. Google Gemini (google-generativeai 0.7.2)

Purpose: AI-powered text generation and vision analysis

Why Gemini:

Free tier available (no credit card required)
State-of-the-art multimodal capabilities
Excellent for academic content analysis
Vision API for diagram understanding
Competitive with GPT-4 in many tasks
Lower latency than other commercial APIs

Models Used:

Gemini 1.5 Pro (gemini-1.5-pro-latest):
- Research query responses
- Text synthesis and analysis
- Citation generation
- Context window: 1M tokens
Gemini 1.5 Flash (gemini-1.5-flash):
- PDF diagram analysis
- Image description
- Faster, lower cost for vision tasks

Alternatives Considered:

OpenAI GPT-4: Higher cost, requires payment
Anthropic Claude: Limited free tier
Local models (LLaMA): Higher compute requirements

Official: https://ai.google.dev/

2. LangChain (0.2.16)

Purpose: AI orchestration and prompt management framework

Why LangChain:

Industry-standard for LLM applications
Built-in memory management for conversations
Excellent prompt templating
Easy integration with multiple LLM providers
Strong community and documentation
Modular architecture for extensibility

Key Features Used:

ChatGoogleGenerativeAI: Gemini integration
ConversationBufferWindowMemory: Conversation tracking
PromptTemplate: Structured prompts for research queries
Chain composition for multi-step reasoning

LangChain Google GenAI (1.0.10):

Official LangChain integration for Google's AI
Seamless Gemini model support
Streaming responses
Function calling support

Alternatives Considered:

Direct API calls: More work, less abstraction
LlamaIndex: Different focus (more on indexing)
Haystack: Less mature Gemini integration

Official: https://python.langchain.com/

3. Sentence Transformers (2.7.0)

Purpose: Generate semantic embeddings for vector search

Why Sentence Transformers:

State-of-the-art semantic similarity models
Fast inference on CPU
Excellent for retrieval tasks
Easy to use and well-documented
Large collection of pre-trained models
Active development and updates

Model Used: all-MiniLM-L6-v2

Dimensions: 384
Speed: ~14,000 sentences/second (CPU)
Performance: Excellent quality/speed tradeoff
Size: ~90MB download
Best for: Semantic search in medium-large corpora

Why all-MiniLM-L6-v2:

Optimal balance of speed and quality
Small embedding size (384d) = lower storage
Fast retrieval with k-NN search
Good performance on academic text
Widely used and battle-tested

Alternative Models Available:

all-mpnet-base-v2: Higher quality (768d), slower
all-MiniLM-L12-v2: Similar size, slightly better quality
multi-qa-mpnet-base-dot-v1: Optimized for Q&A tasks

Official: https://www.sbert.net/

Search and Vector Database

OpenSearch (opensearch-py 2.4.0)

Purpose: Vector database, full-text search, and document indexing

Why OpenSearch:

Open source: Fork of Elasticsearch, Apache 2.0 license
Free to use: No licensing costs
Hybrid search: Combines keyword (BM25) and vector (k-NN) search
Scalable: Handles millions of documents
Feature-rich: Aggregations, filtering, faceting
Local deployment: Runs on your machine, full control
Active development: AWS-backed, regular updates

Key Features Used:

k-NN Vector Search:
- 384-dimensional embeddings
- Cosine similarity for semantic matching
- Fast approximate nearest neighbors
Full-Text Search:
- BM25 ranking algorithm
- Multi-field queries
- Field boosting (title^2)
Hybrid Search:
- Combines keyword and vector results
- Best of both worlds: precise and semantic
- Configurable weighting
Index Management:
- Custom mappings for multi-modal content
- Nested objects for citations
- Keyword arrays for metadata

Schema Design:

{
    'content_type': 'keyword',      # paper/video/podcast
    'title': 'text',                # Searchable title
    'abstract': 'text',             # Paper abstract
    'content': 'text',              # Main content
    'authors': 'keyword',           # Author list
    'publication_date': 'date',     # Temporal filtering
    'embedding': 'knn_vector',      # 384-dim semantic vector
    'diagram_descriptions': 'text', # From Gemini Vision
    'key_concepts': 'keyword',      # Extracted concepts
    'citations': 'nested'           # Citation tracking
}

Deployment: Docker container for local development

docker run -p 9200:9200 -e "discovery.type=single-node" \
  opensearchproject/opensearch:latest

Alternatives Considered:

Elasticsearch: Proprietary license, OpenSearch is the open fork
Pinecone: Cloud-only, costs money, less control
Weaviate: Good but more complex setup
Qdrant: Excellent but newer, smaller ecosystem
ChromaDB: Included but not primary (simpler use cases)
FAISS: No full-text search, just vectors
Milvus: More complex, higher overhead

Why Not ChromaDB:

Included in dependencies (chromadb==0.4.20)
Could be used for simpler deployments
OpenSearch chosen for hybrid search capabilities
ChromaDB better for pure vector search

Official: https://opensearch.org/

Data Collection and Processing

Academic Data Sources

1. ArXiv (arxiv 2.0.0)

Purpose: Collect papers from ArXiv preprint server

Why ArXiv:

Largest open-access preprint repository
Covers CS, physics, math, and more
Free API, no authentication required
Rich metadata (authors, abstract, categories)
Direct PDF download links

Usage:

from arxiv import Search, Client
client.search(query="machine learning", max_results=10)

Official: https://arxiv.org/

2. Scholarly (1.7.11)

Purpose: Collect papers from Google Scholar and Semantic Scholar

Why Scholarly:

Access to Google Scholar's vast database
Semantic Scholar integration
Citation information
No API key required (uses scraping)
Author profiles and metrics

Limitations:

Rate limiting required (avoid IP bans)
May need proxies for heavy use
Less reliable than official APIs

Usage:

from scholarly import scholarly
search = scholarly.search_pubs('deep learning')

Official: https://github.com/scholarly-python-package/scholarly

3. YouTube Transcript API (youtube-transcript-api 0.6.1)

Purpose: Extract transcripts from educational YouTube videos

Why YouTube Transcript API:

Free and no API key required
Accesses official YouTube transcripts
Supports multiple languages
Auto-generated and manual captions
Timestamp information

Usage:

from youtube_transcript_api import YouTubeTranscriptApi
transcript = YouTubeTranscriptApi.get_transcript(video_id)

Alternatives:

YouTube Data API v3: Requires API key, quota limits
pytube (0.15.0): Video metadata extraction
yt-dlp (2024.3.10): Alternative downloader

Official: https://github.com/jdepoix/youtube-transcript-api

4. Feedparser (6.0.10)

Purpose: Parse RSS/Atom feeds for podcast collection

Why Feedparser:

Universal feed parser (RSS, Atom, RDF)
Robust and battle-tested
Handles malformed feeds gracefully
Extract episode metadata
Widely used and maintained

Usage:

import feedparser
feed = feedparser.parse('https://podcast.rss.url')

Official: https://feedparser.readthedocs.io/

Document Processing

1. PyPDF (pypdf 3.17.0)

Purpose: Extract text and metadata from PDF files

Why PyPDF:

Pure Python, no external dependencies
Actively maintained (formerly PyPDF2)
Good text extraction
PDF metadata parsing
Page-by-page processing

Limitations:

May struggle with complex layouts
Scanned PDFs need OCR

Official: https://pypdf.readthedocs.io/

2. PyMuPDF (1.23.8)

Purpose: Advanced PDF processing and image extraction

Why PyMuPDF (fitz):

Fast C-based implementation
Excellent image extraction
Better handling of complex PDFs
Page rendering capabilities
Extract diagrams and figures

Why Both PyPDF and PyMuPDF:

Fallback mechanism: Try PyMuPDF first, fallback to PyPDF
Different PDFs work better with different libraries
Maximize successful text extraction

Usage:

import fitz  # PyMuPDF
doc = fitz.open("paper.pdf")
images = doc[page_num].get_images()

Official: https://pymupdf.readthedocs.io/

3. Whisper (1.1.10)

Purpose: Audio transcription for videos and podcasts

Why Whisper:

State-of-the-art speech recognition
Multilingual support (99 languages)
Open source from OpenAI
Works offline (local processing)
High accuracy on academic content

Models:

tiny, base: Fast, lower accuracy
small, medium: Balanced
large: Best accuracy, slower

Note: Currently included but not fully integrated. YouTube uses existing transcripts.

Official: https://github.com/openai/whisper

Web Framework and UI

Gradio (4.8.0)

Purpose: Web-based user interface

Why Gradio:

Rapid development: Build UIs in minutes
Beautiful defaults: Professional look out of the box
Multi-tab support: Organize complex applications
Real-time updates: Live feedback during processing
Share functionality: Create public URLs instantly
Python-native: No JavaScript required
FastAPI integration: Built on modern async framework

Key Features Used:

Tabbed interface (Research, Data Collection, Citations, Settings)
Text input/output components
Buttons and interactive elements
Real-time status updates
File upload capabilities
Markdown rendering for responses

UI Structure:

with gr.Blocks() as app:
    with gr.Tab("Research"):
        # Query interface
    with gr.Tab("Data Collection"):
        # Collection controls
    with gr.Tab("Citation Manager"):
        # Citation export
    with gr.Tab("Settings"):
        # System configuration

Why Gradio over alternatives:

Streamlit: Less flexible layout, page reloads
Dash: More complex, steeper learning curve
Flask/FastAPI: Would require frontend development
Jupyter: Not ideal for production deployment

FastAPI (0.109.0) + Uvicorn (0.27.0):

Gradio runs on FastAPI (modern async framework)
Uvicorn is the ASGI server
High performance, async support
Easy to extend with custom API endpoints

Official: https://gradio.app/

Supporting Libraries

1. Python-dotenv (1.0.0)

Purpose: Environment variable management

Why python-dotenv:

Load .env files automatically
Keep secrets out of code
Easy configuration management
Development/production separation

Usage:

from dotenv import load_dotenv
load_dotenv()
api_key = os.getenv('GEMINI_API_KEY')

2. Requests (2.31.0)

Purpose: HTTP client for API calls

Why Requests:

De facto standard for HTTP in Python
Simple, elegant API
Robust error handling
Session management
Wide adoption and support

3. BeautifulSoup4 (4.12.2)

Purpose: HTML and XML parsing

Why BeautifulSoup:

Intuitive API for parsing HTML
Robust handling of malformed HTML
Multiple parser backends
Used for web scraping when APIs unavailable

4. Pandas (2.1.3)

Purpose: Data manipulation and analysis

Why Pandas:

DataFrame structure for tabular data
Easy data transformation
CSV export capabilities
Integration with other tools
Citation export and management

5. NumPy (>=1.25.2, <2.0)

Purpose: Numerical computing foundation

Why NumPy:

Dependency for many libraries (Pandas, Sentence Transformers)
Array operations for embeddings
Mathematical computations
Version <2.0 for compatibility

6. tqdm (4.66.1)

Purpose: Progress bars for long operations

Why tqdm:

Visual feedback during collection/processing
Works in terminal and notebooks
Minimal overhead
Easy integration

7. Pydub (0.25.1)

Purpose: Audio processing and manipulation

Why Pydub:

Simple API for audio operations
Format conversion
Audio segmentation
Required for Whisper integration

8. yt-dlp (2024.3.10)

Purpose: YouTube video metadata extraction

Why yt-dlp:

Active fork of youtube-dl
Regular updates for YouTube changes
Robust error handling
Alternative to pytube
More reliable than pytube

Development Tools

Version Control

Git: Source control and collaboration
.gitignore: Excludes venv, .env, data, logs

Virtual Environment

venv: Isolate project dependencies
Platform-independent
Built into Python 3.3+

Package Management

pip: Python package installer
requirements.txt: Dependency specification

Version Compatibility

Python Version Requirements

Minimum: Python 3.9 Recommended: Python 3.11

Why 3.9+:

Type hint improvements (PEP 585, 604)
Dictionary merge operators (|)
String methods (removeprefix, removesuffix)
Required for modern packages

Why 3.11 recommended:

10-60% faster than 3.10
Better error messages
Improved typing features
Current stable release

Critical Dependencies

Must be compatible:

opensearch-py with OpenSearch server version
langchain with langchain-google-genai
numpy <2.0 (compatibility constraint)
sentence-transformers with PyTorch (auto-installed)

Breaking Changes to Watch

NumPy 2.0:

Many libraries not yet compatible
Pinned to <2.0 in requirements
Will update when ecosystem catches up

LangChain:

Rapid development, frequent updates
Pin to 0.2.x for stability
Check migration guides for major versions

OpenSearch:

Compatible with 1.x and 2.x servers
Client version should match server major version

Architecture Decisions

Why Hybrid Search?

Decision: Combine BM25 keyword search with k-NN vector search

Rationale:

Keyword search (BM25): Precise matching, handles specific terms
Vector search (k-NN): Semantic similarity, understands meaning
Together: Best of both worlds

Example:

Query: "neural network training"
BM25 finds: Papers with exact phrase
k-NN finds: Papers about "model optimization", "gradient descent"
Hybrid: Comprehensive results covering all aspects

Why Local-First?

Decision: Run everything locally (OpenSearch, embeddings, processing)

Rationale:

Privacy: Your data stays on your machine
Cost: No cloud fees, only API costs for Gemini
Control: Full control over infrastructure
Offline: Works without constant internet
Speed: No network latency for search

Cloud usage limited to:

Gemini API calls (can be minimized)
Public Gradio link (optional)

Why Gemini over GPT?

Decision: Use Google Gemini as primary LLM

Rationale:

Free tier: No credit card, generous limits
Multimodal: Native vision support
Quality: Competitive with GPT-4
Context: 1M token window
Speed: Fast response times
API: Simple, well-documented

Trade-offs:

GPT-4 may be slightly better for some tasks
OpenAI ecosystem is more mature
But cost and accessibility favor Gemini

Why Gradio over Custom Frontend?

Decision: Use Gradio for UI instead of React/Vue

Rationale:

Speed: Build in hours, not days
Python-only: No JavaScript needed
Good enough: Professional appearance
Share feature: Instant public URLs
Maintenance: Less code to maintain

Trade-offs:

Less customization than custom frontend
Tied to Gradio's update cycle
Limited styling options
But: Faster development and deployment

Alternative Technologies Considered

Vector Databases

Evaluated:

Pinecone: Cloud-only, paid service
Weaviate: More setup complexity
Qdrant: Excellent but newer
Milvus: Higher operational overhead
FAISS: No full-text search
ChromaDB: Simpler but less features

Chosen: OpenSearch (hybrid search, open source, free)

LLM Providers

Evaluated:

OpenAI GPT-4: Better quality, higher cost, requires payment
Anthropic Claude: Good quality, limited free tier
Cohere: Commercial focus, pricing model
Local models (LLaMA, Mistral): No API costs, high compute requirements

Chosen: Google Gemini (free, multimodal, good quality)

Orchestration Frameworks

Evaluated:

Direct API calls: More control, more work
LlamaIndex: Better for pure indexing
Haystack: Less mature Gemini support

Chosen: LangChain (mature, flexible, good Gemini integration)

UI Frameworks

Evaluated:

Streamlit: Popular but less flexible
Dash: Plotly-based, more complex
Flask + React: Full control, more development
Jupyter: Not production-ready

Chosen: Gradio (fast development, good features)

Technology Roadmap

Potential Future Additions

Enhanced Search:

Reranking models (e.g., cross-encoders)
Query expansion techniques
Multi-vector search

More Data Sources:

PubMed Central integration
Springer/Nature APIs (if available)
Podcast platforms (Spotify, Apple)

Advanced AI:

Local LLM support (LLaMA, Mistral)
Fine-tuned models for academic text
Multi-agent systems

Performance:

GPU acceleration for embeddings
Distributed OpenSearch
Caching layer (Redis)

Features:

Document summarization
Automatic concept extraction
Knowledge graph generation
Collaborative features

Dependency Tree

Multi-Modal Academic Research System
│
├── AI & ML
│   ├── google-generativeai (Gemini API)
│   ├── google-genai (newer SDK)
│   ├── langchain (orchestration)
│   ├── langchain-google-genai (integration)
│   └── sentence-transformers (embeddings)
│       └── PyTorch (auto-installed)
│
├── Search & Database
│   ├── opensearch-py (vector DB)
│   └── chromadb (alternative)
│
├── Data Collection
│   ├── arxiv (papers)
│   ├── scholarly (Google Scholar)
│   ├── youtube-transcript-api (YouTube)
│   ├── pytube (video metadata)
│   ├── yt-dlp (alternative downloader)
│   └── feedparser (podcasts)
│
├── Document Processing
│   ├── pypdf (PDF text)
│   ├── PyMuPDF (PDF images)
│   ├── whisper (audio transcription)
│   └── pydub (audio processing)
│
├── Web & UI
│   ├── gradio (UI framework)
│   ├── fastapi (backend)
│   └── uvicorn (ASGI server)
│
└── Supporting
    ├── python-dotenv (config)
    ├── requests (HTTP)
    ├── beautifulsoup4 (parsing)
    ├── pandas (data manipulation)
    ├── numpy (numerical computing)
    └── tqdm (progress bars)

Installation Impact

Download Sizes (Approximate)

Python Packages: ~2GB total

PyTorch (sentence-transformers dependency): ~800MB
Gradio + FastAPI: ~200MB
OpenSearch client: ~50MB
Other packages: ~950MB combined

Models (downloaded on first use):

all-MiniLM-L6-v2 embeddings: ~90MB
Whisper models (optional):
- tiny: ~75MB
- base: ~150MB
- small: ~500MB
- medium: ~1.5GB
- large: ~3GB

Docker Images:

OpenSearch: ~1.2GB

Total First Install: ~3-4GB (without Whisper models)

Installation Time

On typical broadband (50 Mbps):

Python packages: 5-10 minutes
Embedding model: 1-2 minutes
OpenSearch Docker: 2-3 minutes
Total: 10-15 minutes

Summary

The Multi-Modal Academic Research System leverages a modern, open-source technology stack designed for:

Accessibility: Free tools, no paywalls
Quality: State-of-the-art AI and search
Privacy: Local-first architecture
Flexibility: Modular, extensible design
Usability: Simple installation, intuitive interface

Core Stack:

Python 3.11: Modern language features
Google Gemini: Free, powerful AI
OpenSearch: Hybrid search capabilities
LangChain: AI orchestration
Gradio: Rapid UI development
Sentence Transformers: Semantic embeddings

This technology combination enables sophisticated academic research workflows while maintaining simplicity and cost-effectiveness.

For more information:

FilesExpand file tree

technology-stack.md

Latest commit

History

technology-stack.md

File metadata and controls

Technology Stack

Table of Contents

Overview

Core Technologies

Python 3.9+

AI and Machine Learning

1. Google Gemini (google-generativeai 0.7.2)

2. LangChain (0.2.16)

3. Sentence Transformers (2.7.0)

Search and Vector Database

OpenSearch (opensearch-py 2.4.0)

Data Collection and Processing

Academic Data Sources

1. ArXiv (arxiv 2.0.0)

2. Scholarly (1.7.11)

3. YouTube Transcript API (youtube-transcript-api 0.6.1)

4. Feedparser (6.0.10)

Document Processing

1. PyPDF (pypdf 3.17.0)

2. PyMuPDF (1.23.8)

3. Whisper (1.1.10)

Web Framework and UI

Gradio (4.8.0)

Supporting Libraries

1. Python-dotenv (1.0.0)

2. Requests (2.31.0)

3. BeautifulSoup4 (4.12.2)

4. Pandas (2.1.3)

5. NumPy (>=1.25.2, <2.0)

6. tqdm (4.66.1)

7. Pydub (0.25.1)

8. yt-dlp (2024.3.10)

Development Tools

Version Control

Virtual Environment

Package Management

Version Compatibility

Python Version Requirements

Critical Dependencies

Breaking Changes to Watch

Architecture Decisions

Why Hybrid Search?

Why Local-First?

Why Gemini over GPT?

Why Gradio over Custom Frontend?

Alternative Technologies Considered

Vector Databases

LLM Providers

Orchestration Frameworks

UI Frameworks

Technology Roadmap

Potential Future Additions

Dependency Tree

Installation Impact

Download Sizes (Approximate)

Installation Time

Summary