AstroReason-Bench

AstroReason-Bench is a benchmark-core repository for evaluating LLM agents on space mission design and planning problems.

This branch is under active development as we expand the suite with more subtasks and decouple benchmarks from solution implementations. To reproduce the results from the current paper, use the v1 branch, where benchmarks and solutions are still coupled. A separate solutions repository is planned and will be released in the future.

Why This Project?

Aerospace lacks rigorous, standardized, algorithm-agnostic benchmarks. While AI has ImageNet and GLUE, space mission design still lacks a shared evaluation suite built around well-defined problems and verifiable scoring.

This repository focuses on:

Datasets: Canonical benchmark instances
Verifiers: Standalone validation and scoring logic
Reproducibility tools: Optional benchmark-local generators and visualizers

Any approach can be evaluated: LLM agents, metaheuristics, RL systems, or human experts. This repository defines the benchmark; it does not ship solution implementations.

Repository Structure

astro-reason/
├── benchmarks/{name}/
│   ├── dataset/              # Problem instances
│   ├── verifier.py           # or verifier/run.py
│   ├── visualizer.py         # optional, or visualizer/run.py
│   ├── generator.py          # optional, or generator/run.py
│   └── README.md             # Problem specification and file formats
└── tests/
    └── benchmarks/           # Focused tests for verifiers and benchmark tooling

Benchmark Design Principles

Algorithm-agnostic: Benchmarks define problems and verification, not the method used to solve them.
Standalone: Each benchmark is self-contained with no dependencies on other benchmarks.
Reproducible: Optional generators can recreate or extend datasets when appropriate.
Solution-free core: Solutions and baselines live outside this repository.

Environment

This project uses uv for environment management. To ensure verifier integrity, run:

uv run pytest tests/benchmarks/test_<name>_verifier.py

Run focused tests instead of the full suite unless you specifically need a broader check.

Status

Current priorities include

refactoring several newer benchmarks
reimplementing several existing verifiers,
and writing dataset generators for reproducibility.

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
benchmarks		benchmarks
docs/zh		docs/zh
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AstroReason-Bench

Why This Project?

Repository Structure

Benchmark Design Principles

Environment

Status

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AstroReason-Bench

Why This Project?

Repository Structure

Benchmark Design Principles

Environment

Status

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages