Skip to content

feat(core): wire base_commit into docker-workspace for SWE-bench evals #987

@christso

Description

@christso

Objective

The docker-workspace.ts module needs to read base_commit from the eval YAML workspace config and run git reset --hard {base_commit} inside the container before the agent starts work.

Context

PR #986 (agentv import huggingface) now places base_commit in workspace.docker:

workspace:
  docker:
    image: swebench/sweb.eval.django__django:latest
    timeout: 600
    memory: 4g
    base_commit: 4fd3044ca0135da903a70dfb66992293f529ecf1

But docker-workspace.ts (from PR #971) doesn't read this field. Without it, the container repo state is whatever the Docker image was built with, which may not match the specific commit the SWE-bench instance requires.

SWE-bench behavior

SWE-bench's harness does:

  1. Image build time: git reset --hard {base_commit} (baked into the image)
  2. Eval time: Reset test files to base_commit before running tests: git checkout {base_commit} {modified_files}

For pre-built SWE-bench Docker images, the commit is already baked in. But for custom or rebuilt images, the workspace must ensure the correct checkout.

Implementation

  1. Add base_commit to the Docker workspace schema in packages/core/src/evaluation/validation/eval-file.schema.ts
  2. In docker-workspace.ts, after container start, if base_commit is set:
    • Run git reset --hard {base_commit} in the container
    • Verify the checkout succeeded
  3. Add tests

Acceptance criteria

  • workspace.docker.base_commit is recognized in eval YAML
  • Container is checked out to base_commit before agent work begins
  • Works with SWE-bench Docker images
  • Passes agentv validate with the new field

Metadata

Metadata

Assignees

No one assigned

    Labels

    coreAnything pertaining to core functionality of AgentVenhancementNew feature or requestin-progressClaimed by an agent — do not duplicate work

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions