A FINOS Labs initiative for building a taxonomy, datasets, and tooling for system-level evaluation of AI (including agentic workflows) in financial services.
AI systems are non-deterministic and financial tasks rarely have a single “correct” answer. General-purpose, model-only benchmarks miss domain-specific correctness, operational risk, and compliance needs.
This project anchors evaluations in financial use cases, with a taxonomy that links:
Use Cases → Risks → Metrics
That bridges technical benchmarking with business and regulatory reality—supporting trust, comparability, and safer deployment—not a generic leaderboard.
This work is part of the broader FINOS AI agenda: it supplies the evaluation and benchmarking layer that turns governance intent into measurable evidence.
- AI Governance Framework (AIGF): What to govern (policies, risks, guardrails).
- Common Architecture Language Model (CALM) and related specs: Machine-readable expression of those requirements where applicable.
- This framework: Datasets, metrics, reference architectures, and methods (including LLM-as-a-judge patterns) to test whether systems and agents behave acceptably in real workflows.
- FINOS Common Cloud Controls (CCC) and firm controls: Infrastructure and organizational requirements that sit alongside technical evaluation.
Together, these pieces support a transparent, finance-aware path from requirements to verified behavior—not a model factory or a substitute for internal compliance programs.
It is: a taxonomy and methodology hub; a place for evaluation datasets and synthetic-data strategies; reference architectures and implementation patterns for RAG and agentic stacks; task benchmarks and guardrail concepts; and adapters/patterns that map open evaluation tooling to FSI contexts.
It is not: an orchestration platform or LLM factory; a generic public leaderboard; or a replacement for firm-specific policies and sign-off.
Evaluation here emphasizes whole-system behavior (retrieval, tools, routing, traces, policies), not final-token accuracy alone.
Community use cases are the main intake for prioritizing agents, datasets, and evaluation design. The structured form captures description, business value, risks, proposed metrics, and likely system components (LLM, RAG, tools, etc.).
Suggested path:
- Propose — Open a Financial Agent Use Case issue (form source). This creates a shared, reviewable record under the
use-caseworkflow. - Prioritize and align — Maintainers and contributors discuss and prioritize proposals against framework milestones (taxonomy, metrics, reference stacks, synthetic-data pipelines, RAG/agentic evaluation tracks). Open issues in this repository track those themes.
- Implement in dedicated repositories — For agreed priorities, reference agent implementations and evaluation harnesses (datasets, scenarios, thresholds, reproducible runs) typically live in separate FINOS Labs repositories, implemented in line with approved reference architectures and patterns published or agreed here. That keeps this repo focused on taxonomy, standards, and cross-cutting assets while allowing each agent/eval program to iterate quickly with clear scope.
- Lifecycle — New or incubated work generally enters the ecosystem under the updated FINOS Project Lifecycle: Labs for exploratory, neutral collaboration with clear baseline expectations; optional Forming for time-bound setup; Incubating for governance, roadmap discipline, and repeatable open-source practices; Graduated for high maturity, adoption, and sustainability; Archived when maintenance ends but history remains valuable. Security and operational expectations scale with stage (aligned with OpenSSF baselines as described in that post). Broader FINOS project proposals and stage transitions are coordinated through the foundation’s community processes (including issue-based workflows described on FINOS).
For each financial use case, the framework aims to provide:
- Test datasets and synthetic data generation approaches
- Reference architectures and implementation strategies for evaluation
- Metrics, thresholds, and evaluation guidelines
- Monitoring and “glass-box” / trajectory-oriented evaluation guidance where agentic systems are in scope
Associated assets may include open datasets (for example CDM-oriented synthetic scenarios), patterns for observability-backed evaluation, and integration notes for open evaluators adapted to financial tasks.
High-level phases (see workshop summary in the PDF below and milestones as GitHub issues):
- Gather workshop and techsprint artefacts (Sept 2025) → PDF summary
- Literature review, infrastructure, repository launch; taxonomy and milestone definitions advanced on GitHub (2025–2026)
- Use-case intake via the issue template; prioritized work feeding dedicated Labs implementation and evaluation repos
- Template repositories and reference examples aligned to agreed architectures
- Pilot engagement with financial institutions
- Expand shared taxonomy and reusable assets across industry participants
The evaluation stream is on the agenda of the FINOS AI Governance Framework Training Workshop — OSFF Toronto (13 April 2026, Toronto). After sessions on AIGF leader training and the AI Reference Architecture Library (including how that library intersects with AIGF and CALM), the workshop includes “Beyond the Black Box: Operationalizing AgentOps and glass-box evaluations for financial AI”—a practical segment on moving from traditional MLOps to AgentOps and on mapping real-world financial use cases to rigorous, quantitative evaluation metrics, grounded in ongoing work in this FINOS AI Evaluation & Benchmarking framework. That session is a primary moment to recruit contributors and prioritize use cases with governance, risk, and engineering practitioners in the room.
Aim to showcase a selection of those prioritized use cases and evaluation progress at Open Source in Finance Forum London (25 June 2026). If you plan to propose or champion a use case, filing a Financial Agent Use Case issue ahead of Toronto helps the community prepare for discussion and follow-up.
Each milestone in the diagram below is tracked as a GitHub issue in this repository (search issues for “milestone definition” and use-case proposals for active agent threads).
See CONTRIBUTING.md and the FINOS Community Code of Conduct. Use the Financial Agent Use Case template for new evaluation scenarios; use the other issue templates for bugs, features, and support questions.
Copyright 2025 FINOS
Distributed under the Apache License, Version 2.0.
SPDX-License-Identifier: Apache-2.0