Skip to content

Add Harbor Framework Support #8

Open
hrdkbhatnagar wants to merge 5 commits intomainfrom
add_harbor_support
Open

Add Harbor Framework Support #8
hrdkbhatnagar wants to merge 5 commits intomainfrom
add_harbor_support

Conversation

@hrdkbhatnagar
Copy link
Collaborator

@hrdkbhatnagar hrdkbhatnagar commented Jan 13, 2026

Adds Harbor framework support to PostTrainBench, enabling anyone to run our benchmark on cloud GPUs (Modal, Daytona) without needing access to our internal HTCondor cluster.

At the moment:

  • Generate Harbor-compatible task directories from PostTrainBench benchmarks
  • Almost full parity with original pipeline

Tested:

  • Generated task for gsm8k + qwen3-1.7b
  • Ran 1-hour test with Claude Code on Sonnet 4 on Modal
  • Verified end-to-end pipeline (including eval + contam judge)
  • Confirmed accuracy metrics extracted correctly

Usage

  cd src/harbor_adapter
  uv sync

  # Generate a task
  python run_adapter.py --benchmark gsm8k --model qwen3-1.7b --output ./tasks

  # Run with Harbor
  harbor run \
      --path ./tasks/posttrainbench-gsm8k-qwen3-1.7b \
      --agent claude-code \
      --model anthropic/claude-sonnet-4 \
      --env modal

See src/harbor_adapter/README.md for detailed parity tracking. Key points:

  • Agent timeout, GPU access, evaluation: Full parity
  • Contamination judge: Parity
  • Agent duration: Tracked by Harbor in result.json
  • timer.sh: Minor difference (created at task generation vs job start)

Note: Right now I have skipped the installation of flash-attn in the container as we need to have a CUDA runtime for it. In modal the GPU is attached to the sandbox after the container is built, so installation doesn't occur.

Note: I have added a uv environment for us to use in PTB. This is used for using modal and harbor, and is useful in general for reproducibility

Todos:

  • directly before agent is run, install flash_attn and build timer.sh
  • huggingface cache in a modal storage
  • before running evaluation, uninstall and reinstall major dependencies (like transformers, inspect-ai, ...). Make sure NOT to use the cache. ((Alternatively we can look into docker in docker))

@rank-and-file

This comment was marked as duplicate.

@hrdkbhatnagar hrdkbhatnagar added the feature New feature or request label Feb 11, 2026
@hrdkbhatnagar hrdkbhatnagar added this to the V1 Release milestone Feb 11, 2026
Comment on lines +42 to +62
RUN uv pip install --system --no-cache \
accelerate \
boto3 \
bitsandbytes \
datasets \
evaluate \
lm-eval \
openai \
pandas \
scikit-learn \
shortuuid \
tokenizers \
transformers \
trl \
peft \
tiktoken \
inspect-ai \
matplotlib \
certifi

# Note: flash_attn requires GPU to compile - install at runtime if needed:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pin versions like the current images

@hrdkbhatnagar
Copy link
Collaborator Author

things that are remaining to get full parity with the original PTB implementation:

  1. Separate verifier container (eval integrity)
    In our setup, we evaluate the agent's post-trained model in a different container than the one the agent trained in. This prevents reward hacking, the agent could modify eval files in its workspace to inflate its score. In Harbor, the verifier runs in the same sandbox as the agent, so it uses whatever files the agent may have tampered with. We need a way to run the verifier in an isolated environment.

  2. Pre-agent shell command inside the container
    We have a timer script that the agent calls to check remaining time (out of 10 hours). It needs to know when the agent actually started. In our original setup, the host orchestrator writes the start timestamp before launching the agent container. In Harbor, I see lifecycle hooks on the Trial object (TrialEvent.AGENT_START etc.), but those run on the orchestrator side, not inside the sandbox. we need a way to execute a shell command inside the container right before the agent starts , like a pre-agent hook.

  3. Downloading additional directories after a run
    After the agent finishes, we'd like to download its full workspace (/home/agent/workspace/), including the code it wrote and the fine-tuned model weights. Currently Harbor only downloads /logs/agent and /logs/verifier.

@hrdkbhatnagar
Copy link
Collaborator Author

after discussing with Alex from Harbor/tbench:

  1. we could put the verifier in the tests/ directory which only gets uploaded after the agent runs

  2. this is not yet supported be we should look into this, potentially make a PR to harbor

  3. Artifact collection is supported now in harbor, so we should use that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants