Skip to content

Add few-shot prompting for base models and fix setting env vars #6

Open
hrdkbhatnagar wants to merge 7 commits intomainfrom
fewshot_base_models
Open

Add few-shot prompting for base models and fix setting env vars #6
hrdkbhatnagar wants to merge 7 commits intomainfrom
fewshot_base_models

Conversation

@hrdkbhatnagar
Copy link
Collaborator

Changes

Few-shot Support for Base Models:

  • New utility module (src/eval/utils/fewshot_loader.py):
  • Few-shot example files (src/eval/fewshot_examples/):
    • aime2025.json - 3 examples from AIME 2024
    • gpqamain.json - 3 graduate-level science examples
  • Integrated into benchmarks (AIME2025, GPQA Main, not BFCL and HumanEval):
    • Added --fewshot flag: auto (base models only), always, or never
    • Added --num-fewshot to control number of examples
    • Few-shot examples prepended as system message

Sampling Parameters:

Added to all benchmarks (GSM8K, AIME2025, GPQA Main, HumanEval, BFCL):

  • --temperature (default: 0.6)
  • --top-p (default: 0.95)
  • --top-k (default: 20)
  • --epochs for multiple runs per benchmark for standard devitations
    (Defaults follow Qwen's recommendation to avoid greedy decoding for base models.)

Qwen Base Model Template

  • Created src/eval/templates/qwen_base.jinja - simple template without chat tokens (<|im_start|>, etc.)
  • Improves base model performance by avoiding confusion from instruct-format tokens

Eval Configuration Tracking

All benchmarks now save eval_config.json alongside metrics.json containing:

  • Model path and type (base/instruct)
  • Few-shot settings
  • Sampling parameters
  • Epochs

Env variable setting fixes

  • Move source set_env_vars.sh to top of run_baseline.sh
  • Make RESULT_DIR absolute for apptainer --bind
  • RemovePOST_TRAIN_BENCH_* vars from baseline_cluster.sub (caused UNDEFINED in jobs)

Testing

  • Tested baseline evaluation with Qwen/Qwen3-1.7B-Base on all the tasks
  • Verified env vars are correctly set on both login node (zsh) and compute node (bash)

  - Few-shot support for AIME2025, GPQA Main, GSM8K (not BFCL/HumanEval)
  - Add temperature, top_p, top_k, epochs args to all benchmarks
  - Fix baseline pipeline path issues
@hrdkbhatnagar
Copy link
Collaborator Author

Posting here for future, after more discussion:

Let's postpone merging this PR with main until paper deadline. We need to change a few more things that could have conflicts with the main logic:

  • Right now sampling parameters are not model specific (fine for base model). However, since agents will currently also get this, we need to change this so that the benchmark remains true to the original setting.
  • Similar to above, the agent should not get the new base tempalate (for Qwen models)
  • The env var changes will be merged to main as they are working fine and indepeneent of any other logic

@rank-and-file
Copy link
Collaborator

In general, I think for now we can let this live in a distinct branch. I don't think that we need to merge this right now.

But if we want to merge this at some point, I think it requires some rework. To add to the above, there are the following points:

  • currently the evaluate.py scripts are being changed. This breaks parity, as the agents get those same scripts. To avoid this, we could do the following: use a new src/eval/templates/ folder, like src/eval/templates_few_shot/. In the new templates, already pre fill the few-shot examples. This then allows to not change the evaluate.py script and just use the existing --templates-dir option.
  • for temperature etc. same holds, but it is more tricky to change those parameters without changing evaluate.py. One option is to store the model in a local folder and just update the generation_config.json appropriately. Then evaluate this model. This would also allow us not to change evaluate.py. In general, we don't want the agent to be able to change the values by command line arguments, as we cannot replicate such changes in the final evaluation. But the agent can of course change generation_config.json.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants