Cot eval by pengbo121 · Pull Request #45 · icip-cas/AutoAlign

pengbo121 · 2025-07-15T06:13:20Z

删除了eval中多余文件

luxinyu1

模板部分修改，原则：

原生代码若不支持qwen3, llama3+等主流模型，补充支持
尽可能自动化或patch对依赖代码的修改

luxinyu1 · 2025-07-18T07:59:11Z

docs/eval.md

+
+In `eval_cot.yaml`, set `eval_type` to `cot_eval` for Reasoning model evaluation, `per_model_gpu` to define the number of GPUs per model worker, `dp_gpu` to configure data parallelism, `n_samples` to specify the number of samples in evaluation, and `max_length` to set the maximum response length.
+
+The results for AIME24 and AIME25 will be stored in `outputs/{data_name}/{model_name}_{split}_{prompt_type}_{num_test_sample}_seed{seed}_t{temperature}_s{start}_e{end}_{prompt_type}_metrics.json`.


输出全部以./outputs/{model_name}作为base path

luxinyu1 · 2025-07-18T08:00:12Z

docs/eval.md

+    'ms_id': '',
+    'hf_id': '',
+    'local': './data/mbpp_plus/mbpp_plus.jsonl',
+}


考虑自动化这个流程？

luxinyu1 · 2025-07-18T08:04:30Z

docs/eval.md

+    'local': './data/mbpp_plus/mbpp_plus.jsonl',
+}
+```
+2. The `score` method in the MBPPEvaluator class for MBPP evaluation is incompatible with the `evalplus` library. The input to the `evaluate` function in the `evalplus` library needs to be modified from `self.eval(flags)` to `self.eval(**flags)`. Additionally, the parameters `base_only`, `i_just_wanna_run`, and `mini` in `flags` should be set to `False`. The output of the `evaluate` function also needs modification. Originally, it had no output; now, it should return the `pass_at_k` results.


可以考虑monkey patch嘛，这样的说明比较模糊

luxinyu1 · 2025-07-18T08:05:02Z

docs/eval.md

+}
+```
+2. The `score` method in the MBPPEvaluator class for MBPP evaluation is incompatible with the `evalplus` library. The input to the `evaluate` function in the `evalplus` library needs to be modified from `self.eval(flags)` to `self.eval(**flags)`. Additionally, the parameters `base_only`, `i_just_wanna_run`, and `mini` in `flags` should be set to `False`. The output of the `evaluate` function also needs modification. Originally, it had no output; now, it should return the `pass_at_k` results.
+3. If you need to evaluate a new model in LiveCodeBench, you should modify `AutoAlign/src/autoalign/eval/livecodebench/lcb_runner/lm_styles.py`:


先支持常用的，或是从上游合并常用的

luxinyu1 · 2025-07-18T08:13:26Z

docs/eval.md

+- To add a different LMStyle, first add the model type in the LMStyle class, then add the model's LanguageModel class in LanguageModelList.
+- To add a different model under the same LMStyle, directly add the model's LanguageModel class in LanguageModelList.
+4. Due to the potentially varying performance of the reasoning model under few-shot and zero-shot settings, to modify the evaluation templates, you can follow these steps:
+For AIME24 and AIME25 evaluation templates, modify `PROMPT_TEMPLATES` in `AutoAlign/src/autoalign/eval/math_eval/utils.py`.  


AIME24/25 的template没有太多的改动空间，采用"Please reason step by step, and put your final answer within \boxed{{}}."即可，这个fix住

luxinyu1 · 2025-07-18T08:14:53Z

docs/eval.md

+4. Due to the potentially varying performance of the reasoning model under few-shot and zero-shot settings, to modify the evaluation templates, you can follow these steps:
+For AIME24 and AIME25 evaluation templates, modify `PROMPT_TEMPLATES` in `AutoAlign/src/autoalign/eval/math_eval/utils.py`.  
+For LiveCodeBench evaluation templates, modify the `format_prompt_generation` function in `AutoAlign/src/autoalign/eval/livecodebench/lcb_runner/prompts/code_generation.py`.  
+For templates evaluated in OpenCompass, locate the respective Python file for each dataset in `AutoAlign/opencompass/opencompass/configs/datasets` and modify the `template` in `{data_name}_infer_cfg`.


这个地方能不能参考一下 @xudong2001 之前的做法从我们的conversation.py里读取bos eos？

luxinyu1 · 2025-07-18T08:29:58Z

docs/eval.md

+Evaluation Reference Results:
+| Model       | AIME24 | AIME25 | LCB  | MBPP+ | GPQA  | IFEval Prompt-level-strict-accuracy | IFEval Inst-level-strict-accuracy | IFEval Prompt-level-loose-accuracy | IFEval Inst-level-loose-accuracy |
+|-------------|--------|--------|------|-------|-------|-------------------------------------|------------------------------------|-------------------------------------|----------------------------------|
+| distill-1.5b | 27.3   | 21.6   | 17.2 | 1.06  | 27.27 | 26.99                               | 41.13                              | 28.10                               | 42.81                            |


distill-1.5b太模糊改成官方名，为啥MBPP+只有1.06

pengbo121 added 3 commits July 9, 2025 23:12

cot_eval

808cedb

update cot_eval

9774d35

update cot_eval

8cbfa77

luxinyu1 requested changes Jul 18, 2025

View reviewed changes

luxinyu1 reviewed Jul 18, 2025

View reviewed changes

luxinyu1 marked this pull request as draft July 21, 2025 06:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cot eval#45

Cot eval#45
pengbo121 wants to merge 3 commits intoicip-cas:mainfrom
pengbo121:cot_eval

pengbo121 commented Jul 15, 2025

Uh oh!

luxinyu1 left a comment •

edited

Loading

Uh oh!

luxinyu1 Jul 18, 2025

Uh oh!

luxinyu1 Jul 18, 2025

Uh oh!

luxinyu1 Jul 18, 2025

Uh oh!

luxinyu1 Jul 18, 2025

Uh oh!

luxinyu1 Jul 18, 2025

Uh oh!

luxinyu1 Jul 18, 2025

Uh oh!

luxinyu1 Jul 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		In `eval_cot.yaml`, set `eval_type` to `cot_eval` for Reasoning model evaluation, `per_model_gpu` to define the number of GPUs per model worker, `dp_gpu` to configure data parallelism, `n_samples` to specify the number of samples in evaluation, and `max_length` to set the maximum response length.

		The results for AIME24 and AIME25 will be stored in `outputs/{data_name}/{model_name}_{split}_{prompt_type}_{num_test_sample}_seed{seed}_t{temperature}_s{start}_e{end}_{prompt_type}_metrics.json`.

Conversation

pengbo121 commented Jul 15, 2025

Uh oh!

luxinyu1 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

luxinyu1 Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

luxinyu1 Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

luxinyu1 Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

luxinyu1 Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

luxinyu1 Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

luxinyu1 Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

luxinyu1 Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

luxinyu1 left a comment •

edited

Loading