G-Eval Standalone

Standalone native G-Eval benchmark package for Eval Protocol.

This repo includes:

Quickstart

uv sync --extra dev

Set one key:

export OPENAI_API_KEY=...
# or Fireworks (OpenAI-compatible)
export FIREWORKS_API_KEY=...

uv run ep local-test --entry benchmarks/test_geval.py::test_geval_benchmark --ignore-docker

uv run pytest benchmarks/test_geval.py::test_geval_benchmark -q -s

GEVAL_DATA_PATH: path to JSONL dataset (default: data/summeval_sample_100.jsonl)
GEVAL_DIMENSION: coherence|consistency|relevance|fluency (default: consistency)
GEVAL_JUDGE_MODEL: judge model id (default: gpt-4.1-mini)
GEVAL_TOP_LOGPROBS: top logprobs count (default: 5)
GEVAL_SAMPLING_FALLBACK_N: sampling fallback count when logprobs unsupported (default: 20)

uv run geval-build-data --limit 100 --output data/summeval_sample_100.jsonl

The reward computes a discrete score and, when possible, a probability-weighted expected score.
Probability weighting prefers token logprobs and falls back to sampling-based estimation.