🎯 Motivation
flexeval currently writes all model responses and their aggregated evaluation metrics to a single outputs.jsonl file only after the entire evaluation run completes. If the process crashes or is manually interrupted (e.g. network hiccups, provider rate-limit errors, OOM), all partially-computed data are lost and the experiment must be restarted from scratch. This is costly when:
- Hundreds/thousands of prompts are being scored
- We rely on paid API calls
- Evaluations run for hours/days
Reliable incremental persistence + resume would dramatically improve UX and resource efficiency.
🛠️ Desired behavior
-
Incremental checkpointing
- After batch of prompts, append the individual result to
outputs.jsonl.
-
Automatic resumable runs
-
If outputs.jsonl exists and metrics.json is absent, the previous evaluation is deemed incomplete.
- Parse
outputs.jsonl to collect the IDs (or line positions) of prompts that are already finished.
- Skip those prompts and continue the evaluation with the remaining inputs, preserving their original order.
- Append new results to the same
outputs.jsonl, using atomic writes to prevent corruption.
-
If both files are present, assume the run finished successfully and start a fresh evaluation unless the user explicitly opts to overwrite (e.g. with a --force flag).
🎯 Motivation
flexevalcurrently writes all model responses and their aggregated evaluation metrics to a singleoutputs.jsonlfile only after the entire evaluation run completes. If the process crashes or is manually interrupted (e.g. network hiccups, provider rate-limit errors, OOM), all partially-computed data are lost and the experiment must be restarted from scratch. This is costly when:Reliable incremental persistence + resume would dramatically improve UX and resource efficiency.
🛠️ Desired behavior
Incremental checkpointing
outputs.jsonl.Automatic resumable runs
If
outputs.jsonlexists andmetrics.jsonis absent, the previous evaluation is deemed incomplete.outputs.jsonlto collect the IDs (or line positions) of prompts that are already finished.outputs.jsonl, using atomic writes to prevent corruption.If both files are present, assume the run finished successfully and start a fresh evaluation unless the user explicitly opts to overwrite (e.g. with a
--forceflag).