Skip to content

Align BrowseComp and Odysseys evaluators#67

Merged
Julia-Lex merged 1 commit into
mainfrom
codex/evaluator-official-judges
Jun 25, 2026
Merged

Align BrowseComp and Odysseys evaluators#67
Julia-Lex merged 1 commit into
mainfrom
codex/evaluator-official-judges

Conversation

@Julia-Lex

Copy link
Copy Markdown
Contributor

Summary

  • forward private eval CLI options into evaluator extras
  • align BrowseComp grading prompt/output parsing with official grader fields
  • switch Odysseys rubric judging to per-rubric trajectory evaluation with action history and screenshots
  • accumulate Odysseys eval usage across per-rubric judge calls

Test

  • uv run pytest tests/browseruse_bench/test_odysseys.py tests/browseruse_bench/test_browsecomp.py tests/browseruse_bench/test_eval_cli.py tests/browseruse_bench/test_stats.py

@Julia-Lex Julia-Lex merged commit f854654 into main Jun 25, 2026
1 of 2 checks passed
@Julia-Lex Julia-Lex deleted the codex/evaluator-official-judges branch June 25, 2026 03:24

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2cfeda9a82

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +100 to +105
return (
f"User Task (context only): {task}\n\n"
"Evaluate ONLY this rubric item:\n"
+ "\n".join(rubric_lines)
+ f"\n\nFull Action History:\n{action_history}\n\n"
f"Screenshots attached below: {screenshot_count} "

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Include the agent answer in Odysseys rubric prompts

This new per-rubric prompt no longer includes the agent's final answer even though evaluate_one still passes it into grade_rubrics. Many Odysseys rubrics require verifying what the final response states or returns, so the grader can mark those rubrics from screenshots/actions alone and miss cases where the browser reached the right page but the submitted answer omitted or misstated the required fields.

Useful? React with 👍 / 👎.

else:
self.last_usage = None
return "correct: no"
raise

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid dropping BrowseComp tasks after grader API errors

When a BrowseComp grading request exhausts retries, this re-raises an OpenAI error; BaseEvaluator._run_iteration catches openai.OpenAIError and continues without appending an EvalResult. In runs with a transient, credential, or rate-limit failure for a task, that task disappears from the results and summary denominator instead of being recorded as failed, which can make BrowseComp accuracy look better than the attempted evaluation.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant