Skip to content

Latest commit

 

History

History
54 lines (38 loc) · 1.4 KB

File metadata and controls

54 lines (38 loc) · 1.4 KB

G-Eval Standalone

Standalone native G-Eval benchmark package for Eval Protocol.

This repo includes:

  • native G-Eval reward implementation (geval/reward.py)
  • SummEval-aligned benchmark entry (benchmarks/test_geval.py)
  • SummEval sample dataset (data/summeval_sample_100.jsonl)
  • dataset preparation script (geval/prepare_data.py)

Quickstart

uv sync --extra dev

Set one key:

export OPENAI_API_KEY=...
# or Fireworks (OpenAI-compatible)
export FIREWORKS_API_KEY=...

Run benchmark

uv run ep local-test --entry benchmarks/test_geval.py::test_geval_benchmark --ignore-docker

or

uv run pytest benchmarks/test_geval.py::test_geval_benchmark -q -s

Key env vars

  • GEVAL_DATA_PATH: path to JSONL dataset (default: data/summeval_sample_100.jsonl)
  • GEVAL_DIMENSION: coherence|consistency|relevance|fluency (default: consistency)
  • GEVAL_JUDGE_MODEL: judge model id (default: gpt-4.1-mini)
  • GEVAL_TOP_LOGPROBS: top logprobs count (default: 5)
  • GEVAL_SAMPLING_FALLBACK_N: sampling fallback count when logprobs unsupported (default: 20)

Refresh dataset

uv run geval-build-data --limit 100 --output data/summeval_sample_100.jsonl

Notes

  • The reward computes a discrete score and, when possible, a probability-weighted expected score.
  • Probability weighting prefers token logprobs and falls back to sampling-based estimation.