Overview
This project builds a unified evaluation pipeline for multi-tool LLM agents by integrating existing open-source frameworks and adding a custom layer for trace-based analysis and cross-run comparison. The system is designed to maximize the depth and breadth of evaluation signals captured per agent run — covering every measurable dimension of agent behavior from correctness and grounding to cost and latency.
Design
1. LLM Agent
A single multi-tool agent (AutoGen / LangChain) serving as the evaluation target, equipped with tools including RAG (document retrieval), Python/pandas execution, financial data queries, and plotting utilities — chosen specifically to surface a wide range of evaluable behaviors.
2. Trace Logging System
Full execution traces recorded per task, capturing every signal needed for deep evaluation:
- Tool calls with inputs and outputs
- Intermediate reasoning steps
- Final responses
- Per-step token usage
- Per-step latency
Stored in structured JSONL/SQLite format for reliable querying and cross-run comparison.
3. Evaluation Pipeline
Integrates best-in-class open-source evaluation tools to maximize metric coverage:
- RAGAS → grounding, faithfulness, and context relevance scores
- DeepEval → LLM-as-judge correctness and answer quality scoring
- Custom heuristics → tool usage efficiency, redundancy detection, and cost tracking
4. Unified Evaluation Layer
A custom aggregation component that maps trace-level steps to evaluation signals across every measurable dimension:
- Correctness
- Tool efficiency
- Hallucination risk
- Grounding quality
- Cost
- Latency
Enables per-run and cross-run metric comparison from a single unified interface.
5. Reliability Scoring System
A composite reliability score synthesized from all evaluation dimensions, enabling single-number comparison across agent runs while retaining per-dimension breakdowns for diagnosis.
Overview
This project builds a unified evaluation pipeline for multi-tool LLM agents by integrating existing open-source frameworks and adding a custom layer for trace-based analysis and cross-run comparison. The system is designed to maximize the depth and breadth of evaluation signals captured per agent run — covering every measurable dimension of agent behavior from correctness and grounding to cost and latency.
Design
1. LLM Agent
A single multi-tool agent (AutoGen / LangChain) serving as the evaluation target, equipped with tools including RAG (document retrieval), Python/pandas execution, financial data queries, and plotting utilities — chosen specifically to surface a wide range of evaluable behaviors.
2. Trace Logging System
Full execution traces recorded per task, capturing every signal needed for deep evaluation:
Stored in structured JSONL/SQLite format for reliable querying and cross-run comparison.
3. Evaluation Pipeline
Integrates best-in-class open-source evaluation tools to maximize metric coverage:
4. Unified Evaluation Layer
A custom aggregation component that maps trace-level steps to evaluation signals across every measurable dimension:
Enables per-run and cross-run metric comparison from a single unified interface.
5. Reliability Scoring System
A composite reliability score synthesized from all evaluation dimensions, enabling single-number comparison across agent runs while retaining per-dimension breakdowns for diagnosis.