Skip to content

Releases: GoogleCloudPlatform/evalbench

v1.9.0

11 Jun 16:15
f399339

Choose a tag to compare

1.9.0 (2026-06-11)

Features

  • add AgyCliGenerator support to evaluator and models, including test suite and configuration datasets (9e6d71f)
  • add AgyCliGenerator support to evaluator and models, including test suite and configuration datasets (9e6d71f)
  • add Antigravity agent tab and update CLI version retrieval logic (fe1a0ae)
  • add Antigravity agent tab and update CLI version retrieval logic (147680b)
  • dea: support YAML-only configuration and penguins dataset for DEA conversational evaluation (#410) (2e37c1c)
  • recover resolved model label from agy cli logs for statistics bucket tagging (ee7635f)
  • update MCP config translation to support both serverUrl and url fields in agy cli (685b1b2)

Bug Fixes

  • dea: resolve concurrency deadlock in GcpAdcCredentialService (#422) (6619ae5)
  • pin pyopenssl<26.2 to avoid google-auth mTLS regression (#424) (c8916ae)
  • update error message to list supported AgentCliGenerator subclasses (dfe1e8e)
  • update generator key path in trends data mapping to model_config.generator (3739f9d)

v1.8.0

04 Jun 00:33
c4f9c2a

Choose a tag to compare

1.8.0 (2026-06-03)

Features

  • add Codex to agent filters in viewer (a2be23c)
  • add Codex to agent filters in viewer (#387) (2ac6c1f)
  • add filter_native_tools option to trajectory_matcher to optionally ignore native harness tools during scoring (2c34d94)
  • add packaged console script entrypoint to support uvx execution (#385) (8ea07f8)
  • dea: define EvalDeaRequest input model for conversational evaluations (#407) (04b91cf)
  • opt-in function-calling for the Gemini SDK judge (#409) (d97f511)
  • Rename package to google-evalbench and decouple viewer dependencies (#390) (0d75811)
  • scorers: filter native tools in trajectory_matcher with opt-out flag (2c1ab58)
  • stabilize Cloud Run deployment and polish standalone CLI UX (#389) (4720eef)
  • support work_dir for claude code eval (#403) (179e0d3)

Bug Fixes

  • add --no-sync flag to runtime uv run commands to prevent PyPI timeouts (#392) (0c4783c)
  • allow-list files in fake home directory for Gemini CLI (#395) (734bc2a)
  • fix Mesop event routing bug in trends dropdown (023d150)
  • gemini-cli: support 'name' parameter key in skill extraction (#378) (62400da)
  • patch absl help output when running via uvx/launcher (85048e2)
  • prevent silent errors on DB query timeouts and extend deadline (#406) (fbbd31d)
  • surface eval failures instead of silently terminating or crashing (#398) (9c36108)

v1.7.1

07 May 19:38
f36342f

Choose a tag to compare

1.7.1 (2026-05-07)

Bug Fixes

  • trigger release-please for username/password issue (#371) (506a717)

v1.7.0

05 May 18:14
24b022b

Choose a tag to compare

1.7.0 (2026-05-05)

Features

  • add Dataform scorers and plumb isolated fake_home workspace directory tracking (#349) (b4ddfda)
  • Add dbt Scorers for Agent Evaluations (#367) (5496d59)
  • Add GCS Artifacts Reporter for Agent Evaluations (#366) (11def06)
  • implement lifecycle execution for setup and teardown scripts (#360) (9de38da)

Bug Fixes

  • correct indentation in eval_service.py to resolve SyntaxError (8512441)
  • correct return signature in base Orchestrator.process() (ad8228e)
  • correct return signature in remaining Orchestrators (b78a9e9)
  • pass None for missing metrics parameter in AgentOrchestrator results initialization (c72c758)

v1.6.0

30 Apr 17:45
9ebd45e

Choose a tag to compare

1.6.0 (2026-04-30)

Features

  • add exponential backoff for Gemini 429 resource exhausted errors (#352) (11d4236)
  • allow execution of mixed DDL and DML statements in Spanner driver (#351) (64c0b63)
  • chunk Spanner DDL statements into groups of 10 to avoid limits (#353) (306caad)
  • handle quoted fields in CSV setup data (#350) (3ba2334)
  • handle recursive dependency cleanup in Spanner table drop (#356) (3eb4d45)
  • use parameterized queries for data insertion in MySQL, Postgres, and SQLite (#358) (733724d)

Bug Fixes

  • ensure setup and cleanup sql run for ddl and dml queries (#354) (89b7c4d)

v1.5.0

27 Apr 03:08
b6dd641

Choose a tag to compare

1.5.0 (2026-04-26)

Features

  • add support for custom host configuration and insecure gRPC channels via environment variables (#336) (05efee0)

Bug Fixes

  • geminicli: resolve extension names from local manifests to apply settings (#338) (4d82afe)

v1.4.0

15 Apr 01:11
fde9626

Choose a tag to compare

1.4.0 (2026-04-15)

Features

  • scorer/llmrater: add fallback to SQL logic comparison for empty results (#326) (d168ac0)

v1.3.1

10 Apr 22:23
80a929e

Choose a tag to compare

1.3.1 (2026-04-10)

Bug Fixes

  • databases/alloydb: restore correct use_adc flag behavior (#315) (909e11d)
  • generators/query_data_api: add retry support for transient API errors (#317) (e5fdead)

v1.3.0

09 Apr 19:41
da17a11

Choose a tag to compare

1.3.0 (2026-04-09)

Features

  • Add summary_in_response and improve LLM rater resilience (#311) (68b72ee)

v1.2.0

09 Apr 18:13
7fc27a5

Choose a tag to compare

1.2.0 (2026-04-07)

Features

  • adc: support ADC for database authentication (#306) (6cb05e6)
  • add Cloud Run support with entrypoint script, custom CSS, and environment-based XSRF configuration (82fdeca)
  • add UV_NO_SYNC support to run script and update Dockerfile and cloudbuild configuration accordingly (43731f9)
  • allow database name mapping via config (#303) (3e8d25a)
  • geminicli: populate adc in fake home (01c9c5b)
  • geminicli: populate adc in fake home (ce06c9b)
  • implement on_load logic to auto-select job directory from query parameters (4691de4)

Bug Fixes

  • consolidate experiment_config flag into util/flags.py (#304) (432d11e)
  • handle empty queries safely, ensure golden execution, and parse config robustly (#265) (9ba022b)
  • remove backticks from sanitized SQL strings (#297) (4e4e201)