Add Harbor Framework Support by hrdkbhatnagar · Pull Request #8 · aisa-group/PostTrainBench

hrdkbhatnagar · 2026-01-13T17:46:47Z

Adds Harbor framework support to PostTrainBench, enabling anyone to run our benchmark on cloud GPUs (Modal, Daytona) without needing access to our internal HTCondor cluster.

At the moment:

Generate Harbor-compatible task directories from PostTrainBench benchmarks
Almost full parity with original pipeline

Tested:

Generated task for gsm8k + qwen3-1.7b
Ran 1-hour test with Claude Code on Sonnet 4 on Modal
Verified end-to-end pipeline (including eval + contam judge)
Confirmed accuracy metrics extracted correctly

Usage

  cd src/harbor_adapter
  uv sync

  # Generate a task
  python run_adapter.py --benchmark gsm8k --model qwen3-1.7b --output ./tasks

  # Run with Harbor
  harbor run \
      --path ./tasks/posttrainbench-gsm8k-qwen3-1.7b \
      --agent claude-code \
      --model anthropic/claude-sonnet-4 \
      --env modal

See src/harbor_adapter/README.md for detailed parity tracking. Key points:

Agent timeout, GPU access, evaluation: Full parity
Contamination judge: Parity
Agent duration: Tracked by Harbor in result.json
timer.sh: Minor difference (created at task generation vs job start)

Note: Right now I have skipped the installation of flash-attn in the container as we need to have a CUDA runtime for it. In modal the GPU is attached to the sandbox after the container is built, so installation doesn't occur.

Note: I have added a uv environment for us to use in PTB. This is used for using modal and harbor, and is useful in general for reproducibility

Todos:

directly before agent is run, install flash_attn and build timer.sh
huggingface cache in a modal storage
before running evaluation, uninstall and reinstall major dependencies (like transformers, inspect-ai, ...). Make sure NOT to use the cache. ((Alternatively we can look into docker in docker))

src/harbor_adapter/template/environment/contamination_judge.py

hrdkbhatnagar · 2026-02-27T12:07:28Z

src/harbor_adapter/template/environment/Dockerfile

+RUN uv pip install --system --no-cache \
+    accelerate \
+    boto3 \
+    bitsandbytes \
+    datasets \
+    evaluate \
+    lm-eval \
+    openai \
+    pandas \
+    scikit-learn \
+    shortuuid \
+    tokenizers \
+    transformers \
+    trl \
+    peft \
+    tiktoken \
+    inspect-ai \
+    matplotlib \
+    certifi
+
+# Note: flash_attn requires GPU to compile - install at runtime if needed:


pin versions like the current images

hrdkbhatnagar · 2026-03-04T10:51:58Z

things that are remaining to get full parity with the original PTB implementation:

Separate verifier container (eval integrity)
In our setup, we evaluate the agent's post-trained model in a different container than the one the agent trained in. This prevents reward hacking, the agent could modify eval files in its workspace to inflate its score. In Harbor, the verifier runs in the same sandbox as the agent, so it uses whatever files the agent may have tampered with. We need a way to run the verifier in an isolated environment.
Pre-agent shell command inside the container
We have a timer script that the agent calls to check remaining time (out of 10 hours). It needs to know when the agent actually started. In our original setup, the host orchestrator writes the start timestamp before launching the agent container. In Harbor, I see lifecycle hooks on the Trial object (TrialEvent.AGENT_START etc.), but those run on the orchestrator side, not inside the sandbox. we need a way to execute a shell command inside the container right before the agent starts , like a pre-agent hook.
Downloading additional directories after a run
After the agent finishes, we'd like to download its full workspace (/home/agent/workspace/), including the code it wrote and the fine-tuned model weights. Currently Harbor only downloads /logs/agent and /logs/verifier.

hrdkbhatnagar · 2026-03-04T10:53:42Z

after discussing with Alex from Harbor/tbench:

we could put the verifier in the tests/ directory which only gets uploaded after the agent runs
this is not yet supported be we should look into this, potentially make a PR to harbor
Artifact collection is supported now in harbor, so we should use that

hrdkbhatnagar added 3 commits January 13, 2026 18:40

add working harbor implementation

9376e91

correctly copy metadata for judge

60b8311

correct pass ENV vars to the verifier (harbor)fix GPU allocation bug

4a85c57

This comment was marked as duplicate.

Sign in to view

rank-and-file reviewed Feb 10, 2026

View reviewed changes

src/harbor_adapter/template/environment/contamination_judge.py Outdated Show resolved Hide resolved

rank-and-file reviewed Feb 10, 2026

View reviewed changes

src/harbor_adapter/template/environment/contamination_judge.py Outdated Show resolved Hide resolved

rank-and-file reviewed Feb 10, 2026

View reviewed changes

src/harbor_adapter/template/environment/contamination_judge.py Show resolved Hide resolved

hrdkbhatnagar added the feature New feature or request label Feb 11, 2026

hrdkbhatnagar added this to the V1 Release milestone Feb 11, 2026

hrdkbhatnagar added 2 commits February 22, 2026 14:16

Merge remote-tracking branch 'origin/main' into add_harbor_support

3447d2b

update with arenahard and healthbench, timer parity, other changes

0293f90

hrdkbhatnagar commented Feb 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Harbor Framework Support #8

Add Harbor Framework Support #8
hrdkbhatnagar wants to merge 5 commits intomainfrom
add_harbor_support

hrdkbhatnagar commented Jan 13, 2026 •

edited

Loading

Uh oh!

This comment was marked as duplicate.

Uh oh!

Uh oh!

Uh oh!

hrdkbhatnagar Feb 27, 2026

Uh oh!

hrdkbhatnagar commented Mar 4, 2026

Uh oh!

hrdkbhatnagar commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hrdkbhatnagar commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as duplicate.

Uh oh!

Uh oh!

Uh oh!

hrdkbhatnagar Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

hrdkbhatnagar commented Mar 4, 2026

Uh oh!

hrdkbhatnagar commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hrdkbhatnagar commented Jan 13, 2026 •

edited

Loading