Skip to content

Cot eval#45

Draft
pengbo121 wants to merge 3 commits intoicip-cas:mainfrom
pengbo121:cot_eval
Draft

Cot eval#45
pengbo121 wants to merge 3 commits intoicip-cas:mainfrom
pengbo121:cot_eval

Conversation

@pengbo121
Copy link
Collaborator

删除了eval中多余文件

Copy link
Collaborator

@luxinyu1 luxinyu1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

模板部分修改,原则:

  • 原生代码若不支持qwen3, llama3+等主流模型,补充支持
  • 尽可能自动化或patch对依赖代码的修改


In `eval_cot.yaml`, set `eval_type` to `cot_eval` for Reasoning model evaluation, `per_model_gpu` to define the number of GPUs per model worker, `dp_gpu` to configure data parallelism, `n_samples` to specify the number of samples in evaluation, and `max_length` to set the maximum response length.

The results for AIME24 and AIME25 will be stored in `outputs/{data_name}/{model_name}_{split}_{prompt_type}_{num_test_sample}_seed{seed}_t{temperature}_s{start}_e{end}_{prompt_type}_metrics.json`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

输出全部以./outputs/{model_name}作为base path

'ms_id': '',
'hf_id': '',
'local': './data/mbpp_plus/mbpp_plus.jsonl',
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

考虑自动化这个流程?

'local': './data/mbpp_plus/mbpp_plus.jsonl',
}
```
2. The `score` method in the MBPPEvaluator class for MBPP evaluation is incompatible with the `evalplus` library. The input to the `evaluate` function in the `evalplus` library needs to be modified from `self.eval(flags)` to `self.eval(**flags)`. Additionally, the parameters `base_only`, `i_just_wanna_run`, and `mini` in `flags` should be set to `False`. The output of the `evaluate` function also needs modification. Originally, it had no output; now, it should return the `pass_at_k` results.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以考虑monkey patch嘛,这样的说明比较模糊

}
```
2. The `score` method in the MBPPEvaluator class for MBPP evaluation is incompatible with the `evalplus` library. The input to the `evaluate` function in the `evalplus` library needs to be modified from `self.eval(flags)` to `self.eval(**flags)`. Additionally, the parameters `base_only`, `i_just_wanna_run`, and `mini` in `flags` should be set to `False`. The output of the `evaluate` function also needs modification. Originally, it had no output; now, it should return the `pass_at_k` results.
3. If you need to evaluate a new model in LiveCodeBench, you should modify `AutoAlign/src/autoalign/eval/livecodebench/lcb_runner/lm_styles.py`:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

先支持常用的,或是从上游合并常用的

- To add a different LMStyle, first add the model type in the LMStyle class, then add the model's LanguageModel class in LanguageModelList.
- To add a different model under the same LMStyle, directly add the model's LanguageModel class in LanguageModelList.
4. Due to the potentially varying performance of the reasoning model under few-shot and zero-shot settings, to modify the evaluation templates, you can follow these steps:
For AIME24 and AIME25 evaluation templates, modify `PROMPT_TEMPLATES` in `AutoAlign/src/autoalign/eval/math_eval/utils.py`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AIME24/25 的template没有太多的改动空间,采用"Please reason step by step, and put your final answer within \boxed{{}}."即可,这个fix住

4. Due to the potentially varying performance of the reasoning model under few-shot and zero-shot settings, to modify the evaluation templates, you can follow these steps:
For AIME24 and AIME25 evaluation templates, modify `PROMPT_TEMPLATES` in `AutoAlign/src/autoalign/eval/math_eval/utils.py`.
For LiveCodeBench evaluation templates, modify the `format_prompt_generation` function in `AutoAlign/src/autoalign/eval/livecodebench/lcb_runner/prompts/code_generation.py`.
For templates evaluated in OpenCompass, locate the respective Python file for each dataset in `AutoAlign/opencompass/opencompass/configs/datasets` and modify the `template` in `{data_name}_infer_cfg`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个地方能不能参考一下 @xudong2001 之前的做法 从我们的conversation.py里读取bos eos?

Evaluation Reference Results:
| Model | AIME24 | AIME25 | LCB | MBPP+ | GPQA | IFEval Prompt-level-strict-accuracy | IFEval Inst-level-strict-accuracy | IFEval Prompt-level-loose-accuracy | IFEval Inst-level-loose-accuracy |
|-------------|--------|--------|------|-------|-------|-------------------------------------|------------------------------------|-------------------------------------|----------------------------------|
| distill-1.5b | 27.3 | 21.6 | 17.2 | 1.06 | 27.27 | 26.99 | 41.13 | 28.10 | 42.81 |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

distill-1.5b太模糊 改成官方名,为啥MBPP+只有1.06

@luxinyu1 luxinyu1 marked this pull request as draft July 21, 2025 06:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants