Add few-shot prompting for base models and fix setting env vars by hrdkbhatnagar · Pull Request #6 · aisa-group/PostTrainBench

hrdkbhatnagar · 2026-01-07T19:35:54Z

Changes

Few-shot Support for Base Models:

New utility module (src/eval/utils/fewshot_loader.py):
Few-shot example files (src/eval/fewshot_examples/):
- aime2025.json - 3 examples from AIME 2024
- gpqamain.json - 3 graduate-level science examples
Integrated into benchmarks (AIME2025, GPQA Main, not BFCL and HumanEval):
- Added --fewshot flag: auto (base models only), always, or never
- Added --num-fewshot to control number of examples
- Few-shot examples prepended as system message

Sampling Parameters:

Added to all benchmarks (GSM8K, AIME2025, GPQA Main, HumanEval, BFCL):

--temperature (default: 0.6)
--top-p (default: 0.95)
--top-k (default: 20)
--epochs for multiple runs per benchmark for standard devitations
(Defaults follow Qwen's recommendation to avoid greedy decoding for base models.)

Qwen Base Model Template

Created src/eval/templates/qwen_base.jinja - simple template without chat tokens (<|im_start|>, etc.)
Improves base model performance by avoiding confusion from instruct-format tokens

Eval Configuration Tracking

All benchmarks now save eval_config.json alongside metrics.json containing:

Model path and type (base/instruct)
Few-shot settings
Sampling parameters
Epochs

Env variable setting fixes

Move source set_env_vars.sh to top of run_baseline.sh
Make RESULT_DIR absolute for apptainer --bind
RemovePOST_TRAIN_BENCH_* vars from baseline_cluster.sub (caused UNDEFINED in jobs)

Testing

Tested baseline evaluation with Qwen/Qwen3-1.7B-Base on all the tasks
Verified env vars are correctly set on both login node (zsh) and compute node (bash)

- Few-shot support for AIME2025, GPQA Main, GSM8K (not BFCL/HumanEval) - Add temperature, top_p, top_k, epochs args to all benchmarks - Fix baseline pipeline path issues

hrdkbhatnagar · 2026-01-08T13:22:01Z

Posting here for future, after more discussion:

Let's postpone merging this PR with main until paper deadline. We need to change a few more things that could have conflicts with the main logic:

Right now sampling parameters are not model specific (fine for base model). However, since agents will currently also get this, we need to change this so that the benchmark remains true to the original setting.
Similar to above, the agent should not get the new base tempalate (for Qwen models)
The env var changes will be merged to main as they are working fine and indepeneent of any other logic

rank-and-file · 2026-02-10T13:56:45Z

In general, I think for now we can let this live in a distinct branch. I don't think that we need to merge this right now.

But if we want to merge this at some point, I think it requires some rework. To add to the above, there are the following points:

currently the evaluate.py scripts are being changed. This breaks parity, as the agents get those same scripts. To avoid this, we could do the following: use a new src/eval/templates/ folder, like src/eval/templates_few_shot/. In the new templates, already pre fill the few-shot examples. This then allows to not change the evaluate.py script and just use the existing --templates-dir option.
for temperature etc. same holds, but it is more tricky to change those parameters without changing evaluate.py. One option is to store the model in a local folder and just update the generation_config.json appropriately. Then evaluate this model. This would also allow us not to change evaluate.py. In general, we don't want the agent to be able to change the values by command line arguments, as we cannot replicate such changes in the final evaluation. But the agent can of course change generation_config.json.

hrdkbhatnagar added 2 commits January 7, 2026 20:28

Add few-shot prompting for base models and sampling parameters

1e00667

- Few-shot support for AIME2025, GPQA Main, GSM8K (not BFCL/HumanEval) - Add temperature, top_p, top_k, epochs args to all benchmarks - Fix baseline pipeline path issues

remove spurious UNDEFINED

9316ef8

hrdkbhatnagar added 5 commits January 15, 2026 13:49

Merge branch 'main' into fewshot_base_models

c8254d9

add epochs to aime and arenahard (dummy doesnt work)

e0ddeb1

add script to check for file locking runs (stale file)

f9b6112

Merge remote-tracking branch 'origin/main' into fewshot_base_models

89b8d7d

add few shot for arenahard and healthbench

8c190e8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add few-shot prompting for base models and fix setting env vars #6

Add few-shot prompting for base models and fix setting env vars #6
hrdkbhatnagar wants to merge 7 commits intomainfrom
fewshot_base_models

hrdkbhatnagar commented Jan 7, 2026

Uh oh!

hrdkbhatnagar commented Jan 8, 2026

Uh oh!

rank-and-file commented Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hrdkbhatnagar commented Jan 7, 2026

Few-shot Support for Base Models:

Sampling Parameters:

Qwen Base Model Template

Eval Configuration Tracking

Env variable setting fixes

Testing

Uh oh!

hrdkbhatnagar commented Jan 8, 2026

Uh oh!

rank-and-file commented Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants