This repository contains a benchmark to evaluate large language models (LLMs) on several finance-related question. Most evaluation use an image as the context (for example a chart or a table image). The only exception is the NER evaluation, which uses a short paragraph as context.
The benchmark runs model generation, automated evaluations against several judge models, per-task aggregation, and final aggregation across models.
The repository contains the following evaluation tasks. Each task generally provides the model with a context (image or paragraph) and a target set of questions or prompts.
-
NER- Input: a paragraph (text)
- Goal: named-entity recognition and extraction in a finance setting.
-
charts- Input: chart images (figures)
- Goal: read and interpret charts, answer questions about trends, values, and relationships shown on the chart.
-
tables- Input: table images
- Goal: extract and reason over structured table data presented as images.
-
tables_yn_tf- Input: table images
- Goal: a reformulation of the table evaluation focusing on yes/no or true/false style queries and paraphrase robustness.
-
special_cases- Input: mostly image context; contains edge cases and intentionally tricky examples to probe model robustness.
-
calculs_conversation- Input: image context (usually) and a multi-turn conversation where the model must perform calculations and carry context across turns.
-
calculs_conversation_gold- Input: like
calculs_conversationbut at each question the benchmark provides the LLM with the gold (correct) response for the previous turn. - Goal: measure performance when the model is fed the correct prior replies (useful to isolate downstream reasoning from accumulation of previous generation errors).
- Input: like
Datasets for these tasks live in dataset_json/ (e.g.
Q&A_finance.dataset_charts.json, Q&A_finance.dataset_NER.json, etc.).
generation.py— runs the tested model to generate answers for a given task.evaluate.py— compares a tested model's outputs with judge models to compute evaluation metrics.group_evaluation.py— aggregates evaluation metrics per model/task.aggregate_results.py— final aggregation across tested models, producing repo-level summary outputs and tables.run_model.sh— SLURM job script that runs generation + evaluation + group aggregation for a single tested model across all tasks (used as a job-array; see below).run.sh— convenience script that submitsrun_model.shfor multiple tested models and schedules the finalaggregate_results.pyjob after model jobs finish.
Results, logs and intermediate outputs are written to these folders:
logs/— SLURM job logs and evaluation logs (seerun_model.sh).results/— per-run outputs produced by evaluation/generation steps.aggregated_results/— aggregated metrics and result tables.
- Python 3.12+ (the
pyproject.tomlrequires >=3.12). - A CUDA-enabled GPU environment if you plan to run large vision-enabled models or use vLLM for efficient inference.
Install dependencies (recommended using the repository's uv runner):
# install the uv runner if you don't have it
pip install uv
# install project dependencies via uv (reads pyproject)
uv syncThis benchmark use meta-llama/Llama-3.3-70B-Instruct Qwen/Qwen3-32B google/gemma-3-27b-it models to judge the tested models. Make sure you have access to these models on HuggingFace and that you have downloaded them before running the benchmark.
-
run_model.sh <MODEL>is a SLURM script designed to be submitted withsbatch. It uses a SLURM array to iterate over the task list:TASKS=(NER charts calculs_conversation calculs_conversation_gold special_cases tables tables_yn_tf)
The SLURM array index selects which TASK to run for that job-array task.
Inside each array task
run_model.sh:- runs
generation.pyfor the tested model and the selected task - runs
evaluate.pyrepeatedly with several judge models to compare the tested model's outputs - runs
group_evaluation.pyto aggregate per-task metrics for the model
- runs
-
run.shcontains an array of tested models and submitsrun_model.shviasbatchfor each model. After all model jobs are submitted,run.shsubmits a dependent job that runsuv run aggregate_results.pyand waits for all model jobs to complete before starting aggregation.
Notes:
- The scripts use the
uvcommand to run the repository scripts (e.g.uv run generation.py), which is the project's task runner. If you don't haveuv, you can run the Python modules directly (see examples below). - The SLURM scripts assume an HPC environment. To run locally without SLURM
you can directly call the scripts (or use
uvif available).
- Run the full multi-model benchmark (on a SLURM cluster):
./run.shThis will submit run_model.sh (via sbatch) for each model listed in
run.sh and then submit a final aggregation job depending on those jobs.
- Submit a single model job (SLURM):
sbatch run_model.sh <huggingface-model-id-or-local-model-ref>Example:
sbatch run_model.sh Qwen/Qwen3-VL-8B-Instruct- Run a single task locally (no SLURM) using
uv(if available):
uv run generation.py <MODEL> <TASK>
# then evaluate with a judge model
uv run evaluate.py <MODEL> <JUDGE_MODEL> <TASK>
uv run group_evaluation.py <MODEL> <TASK>Or run the Python modules directly (replace uv run with python):
python generation.py <MODEL> <TASK>
python evaluate.py <MODEL> <JUDGE_MODEL> <TASK>
python group_evaluation.py <MODEL> <TASK>- Logs and outputs
- SLURM stdout/stderr are stored under
logs/(filenames come from the SBATCH directives inrun_model.shandrun.sh). - Per-run results and intermediate artifacts are written to
results/andaggregated_results/depending on the script.
- Edit the
modelsarray insiderun.shto add or remove tested models. - Or submit
run_model.shdirectly with the model identifier you want to test (see examples above).
- Add the task name to the
TASKSarray insiderun_model.shand implement the data file(s) indataset_json/and handling logic ingeneration.py/evaluate.py.
- The benchmark primarily assumes access to an HPC/SLURM environment for large-scale model runs. If you want to run locally, reduce the task footprint and run tasks sequentially.
run_model.shis configured as a SLURM job-array. The array indices map to theTASKSlist in the file. Make sure to update both if you add/remove tasks.- If
uvis not available on your system, run the Python scripts directly with your Python environment.
- Check
logs/for SLURM stdout/stderr when jobs fail. - If aggregation doesn't run, ensure the initial
sbatchsubmissions inrun.shsucceeded and that the job IDs were captured correctly.
This work will be published as part of the 7th Financial Narrative Processing Workshop, colocated with LREC2026. If you use this model, please cite:
title={When Tables Go Crazy: Evaluating Multimodal Models on French Financial Documents},
author={Virginie Mouilleron and Théo Lasnier and Anna Mosolova and Djamé Seddah},
year={2026},
eprint={2602.10384},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2602.10384}
}