ICM is a five-component index that measures convergence across multiple epistemic methods -- how much independent models agree on a prediction, and whether that agreement is trustworthy enough to act on. Instead of picking the "best" model, ICM quantifies multi-model consensus through distributional agreement (A), directional consistency (D), uncertainty overlap (U), perturbation invariance (C), and a dependency penalty (Pi), all fused via a logistic sigmoid into a single [0, 1] score. A companion Conformal Risk Control (CRC) gating layer maps ICM scores to three-way decisions: ACT, DEFER, or AUDIT -- with finite-sample coverage guarantees.
ICM = sigma(scale * (w_A * A + w_D * D + w_U * U + w_C * C - lambda * Pi - shift))
| Component | What it measures | Default weight |
|---|---|---|
| A (Agreement) | Distributional similarity across models (Hellinger / Wasserstein / MMD) | 0.35 |
| D (Direction) | Sign / argmax consistency of predictions | 0.15 |
| U (Uncertainty) | Overlap of prediction intervals or top-K probabilities | 0.25 |
| C (Invariance) | Stability under input perturbation | 0.10 |
| Pi (Dependency) | Penalty for correlated residuals / shared features / gradient similarity | 0.15 |
pip install os-multi-scienceimport numpy as np
from framework.icm import compute_icm_from_predictions
from framework.config import ICMConfig
# Predictions from 3 independent models (probability distributions over 3 classes)
predictions = {
"model_A": np.array([[0.7, 0.2, 0.1], [0.6, 0.3, 0.1]]),
"model_B": np.array([[0.65, 0.25, 0.1], [0.55, 0.35, 0.1]]),
"model_C": np.array([[0.72, 0.18, 0.1], [0.58, 0.32, 0.1]]),
}
config = ICMConfig.wide_range_preset()
result = compute_icm_from_predictions(predictions, config=config)
print(f"ICM score: {result.icm:.3f}") # High agreement -> score near 1.0Evaluated on 22 UCI / OpenML datasets with 5-fold cross-validation, 8 methods (including Deep Ensemble, Stacking, Bagging):
| Metric | ICM-Weighted | ICM-Optimized | Deep Ensemble |
|---|---|---|---|
| Mean accuracy | 0.891 | 0.898 | -- |
| Friedman rank | 4.55 | 3.62 (2nd) | 3.45 (1st) |
| UQ set size | 1.26 | -- | -- |
| vs. RAPS set size | 55% smaller (1.26 vs 2.87) | -- | -- |
| C-component AUROC | 1.000 | -- | -- |
| Transfer attack AUROC | 1.000 | -- | -- |
Friedman test: chi2 = 29.191, p = 0.000134 (significant at alpha = 0.01). Critical difference = 2.348 (Nemenyi post-hoc). ICM-Optimized is not significantly different from Deep Ensemble.
ICM directly supports two key articles of the EU AI Act:
- Art. 14 (Human Oversight): CRC gating provides a principled ACT / DEFER / AUDIT mechanism. High-risk predictions (low ICM) are automatically routed to human review with finite-sample coverage guarantees.
- Art. 15 (Risk Assessment): The five-component decomposition provides an auditable breakdown of why a prediction is (or is not) trustworthy, enabling transparent risk documentation.
ICM generalizes beyond classical ML to evaluate convergence in multi-agent LLM systems -- treating each agent's output as one "epistemic method." This enables:
- Measuring agreement across multiple LLM agents on the same query
- Detecting hallucination divergence (low A, low D)
- Routing uncertain queries to human review via CRC gating
See examples/llm_convergence.py for a demonstration.
If you use ICM in your research, please cite:
@article{stanisljevic2026icm,
title={Index of Convergence Multi-epistemic: A Five-Component Framework for
Trustworthy Multi-Model Decision-Making},
author={Stanisljevic, Luka},
journal={arXiv preprint},
year={2026}
}This project is licensed under the MIT License. See pyproject.toml for details.