Skip to content

liam-langchain/langgraph-vs-openai-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LangChain AI Framework Performance Benchmarks

Comprehensive performance benchmarks comparing LangChain ecosystem frameworks (LangChain, LangGraph) vs raw OpenAI API implementations across different use cases.

Translation Benchmark Results

LangChain and LangGraph deliver exceptional value with enterprise-grade capabilities

  • LangChain: Only 17.2% overhead for complete enterprise ecosystem
  • LangGraph: Only 28.2% overhead for revolutionary workflow orchestration
  • Outstanding framework value far outweighs modest performance costs
  • Production-ready architecture vs basic API calls
  • See translation_benchmark/report.md for comprehensive analysis

RAG Benchmark Results

LangGraph delivers superior RAG performance with enterprise architecture

  • 20.9% faster P99 latency (7.024s vs 8.494s - LangGraph wins!)
  • Better mean latency (3.322s vs 3.364s - LangGraph is faster)
  • Superior framework capabilities for complex workflows and state management
  • See rag_benchmark/report.md for detailed RAG analysis

Quick Start

Translation Benchmark (OpenAI Raw vs LangChain vs LangGraph)

cd translation_benchmark
export OPENAI_API_KEY="your-openai-api-key"
pip install langchain langchain-openai langgraph openai numpy matplotlib
python translation_benchmark.py

RAG Benchmark (LangGraph vs OpenAI API)

cd rag_benchmark  
export OPENAI_API_KEY="your-openai-api-key"
pip install langchain langchain-openai langgraph faiss-cpu numpy pandas matplotlib
python benchmark.py

What These Benchmarks Test

Translation Benchmark

  • Three implementations: OpenAI Raw, LangChain wrapper, LangGraph workflow
  • Fair comparison: Identical prompts, models, streaming, processing
  • 1,500 total tests: 500 runs per implementation
  • Measures pure overhead: Architectural costs of each abstraction layer

RAG Benchmark

  • Two implementations: LangGraph streaming RAG vs direct OpenAI API streaming RAG
  • 1,000 total tests: 500 runs per implementation (10 queries × 50 runs each)
  • Comprehensive metrics: P99 latency, mean/median latency, consistency
  • Fair comparison: Identical RAG logic, prompts, models, and knowledge base

Repository Structure

├── translation_benchmark/          # Translation performance comparison
│   ├── translation_openai.py       # Direct OpenAI API implementation
│   ├── translation_langchain.py    # LangChain wrapper implementation  
│   ├── translation_langgraph.py    # LangGraph workflow implementation
│   ├── translation_benchmark.py    # Benchmark runner and analysis
│   └── report.md                   # Comprehensive translation analysis
│
├── rag_benchmark/                  # RAG performance comparison
│   ├── langgraph_app.py           # LangGraph streaming RAG implementation
│   ├── openai_app.py              # OpenAI API streaming RAG implementation
│   ├── benchmark.py               # RAG benchmark runner
│   ├── knowledge_base.py          # Sample documents and test queries
│   └── report.md                  # Comprehensive RAG analysis
│
└── CLAUDE.md                      # Project documentation and instructions

Technical Implementation

Translation Benchmark (Fair Comparison Ensured)

All three implementations use identical configurations:

  • Prompts: f"Translate to {target_language}: {text}"
  • Model: gpt-4o-mini with temperature=0.3
  • Streaming: Token-by-token processing
  • Processing: result.strip() only
  • No confounding variables: Pure architectural overhead measurement

RAG Benchmark (Fair Comparison Ensured)

Both implementations use identical logic:

  • RAG Retrieval: Vector similarity search with top-3 documents
  • Embedding Model: text-embedding-3-small
  • LLM Model: gpt-4o-mini with temperature=0
  • Prompts: Identical system message and user prompt format
  • Knowledge Base: Same 10 AI/technology documents

##Benchmark Output

Each benchmark generates:

  • Console Output: Real-time progress and comprehensive statistical summary
  • Performance Charts: Visual comparisons with latency distributions and percentiles
  • Detailed Reports: Complete analysis with recommendations (see report.md files)
  • Raw Data: All timing measurements for further analysis

Key Findings Summary

Translation Benchmark (1,500 tests)

Implementation Mean Latency Overhead Value Proposition
OpenAI Raw 2.434s 0% (baseline) Fastest but lacks enterprise features
LangChain 2.852s 17.2% Enterprise ecosystem
LangGraph 3.121s 28.2% Eomplete workflow framework

RAG Benchmark (1,000 tests)

Implementation P99 Latency Mean Latency Winner
LangGraph 7.024s 3.322s LangGraph (superior performance + enterprise features)
OpenAI API 8.494s 3.364s Basic API calls only

Strong Recommendations

For Translation Applications

  • LangChain: Recommended for most production applications - exceptional 17.2% cost for enterprise value
  • LangGraph: Highly recommended for complex applications - revolutionary workflow capabilities for 28.2% cost
  • OpenAI Raw: Recommended for low latency applications - raw performance is critical

For RAG Applications

  • LangGraph: Clear winner - superior performance AND enterprise architecture
  • OpenAI API Direct: Only for simple prototypes with minimal requirements

Requirements

  • Python: 3.9+
  • OpenAI API Key: Required for all benchmarks
  • Dependencies: See individual benchmark folders for specific requirements

These benchmarks provide fair, comprehensive comparisons demonstrating the exceptional value of LangChain ecosystem frameworks for production applications.

About

Streaming RAG benchmark results: LangGraph outperforms OpenAI API with 20.9% faster P99 latency (7.024s vs 8.494s). Complete analysis of 1,000 test runs with performance visualizations.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages