Align BrowseComp and Odysseys evaluators by Julia-Lex · Pull Request #67 · lexmount/browseruse-agent-bench

Julia-Lex · 2026-06-25T03:23:59Z

Summary

forward private eval CLI options into evaluator extras
align BrowseComp grading prompt/output parsing with official grader fields
switch Odysseys rubric judging to per-rubric trajectory evaluation with action history and screenshots
accumulate Odysseys eval usage across per-rubric judge calls

Test

uv run pytest tests/browseruse_bench/test_odysseys.py tests/browseruse_bench/test_browsecomp.py tests/browseruse_bench/test_eval_cli.py tests/browseruse_bench/test_stats.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2cfeda9a82

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-25T03:30:45Z

+    return (
+        f"User Task (context only): {task}\n\n"
+        "Evaluate ONLY this rubric item:\n"
+        + "\n".join(rubric_lines)
+        + f"\n\nFull Action History:\n{action_history}\n\n"
+        f"Screenshots attached below: {screenshot_count} "


Include the agent answer in Odysseys rubric prompts

This new per-rubric prompt no longer includes the agent's final answer even though evaluate_one still passes it into grade_rubrics. Many Odysseys rubrics require verifying what the final response states or returns, so the grader can mark those rubrics from screenshots/actions alone and miss cases where the browser reached the right page but the submitted answer omitted or misstated the required fields.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-25T03:30:45Z

                else:
                    self.last_usage = None
-                    return "correct: no"
+                    raise


Avoid dropping BrowseComp tasks after grader API errors

When a BrowseComp grading request exhausts retries, this re-raises an OpenAI error; BaseEvaluator._run_iteration catches openai.OpenAIError and continues without appending an EvalResult. In runs with a transient, credential, or rate-limit failure for a task, that task disappears from the results and summary denominator instead of being recorded as failed, which can make BrowseComp accuracy look better than the attempted evaluation.

Useful? React with 👍 / 👎.

Align BrowseComp and Odysseys evaluators

2cfeda9

Julia-Lex merged commit f854654 into main Jun 25, 2026
1 of 2 checks passed

Julia-Lex deleted the codex/evaluator-official-judges branch June 25, 2026 03:24

chatgpt-codex-connector Bot reviewed Jun 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Align BrowseComp and Odysseys evaluators#67

Align BrowseComp and Odysseys evaluators#67
Julia-Lex merged 1 commit into
mainfrom
codex/evaluator-official-judges

Julia-Lex commented Jun 25, 2026

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 25, 2026

Uh oh!

chatgpt-codex-connector Bot Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Julia-Lex commented Jun 25, 2026

Summary

Test

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant