Benchmarking open-weight LLM coding agents as SCOUT delegates: model comparison experiments with pre-registered protocols, blind scoring, and full data.
benchmarking scout goose ai-agents model-comparison supplementary-materials llm-evaluation open-weight-models
-
Updated
Mar 28, 2026 - Python