Skip to content

[WIP] Fix redundant reruns in npu validation scripts#306

Open
Zhendong404 wants to merge 1 commit intohw-native-sys:mainfrom
Zhendong404:fix-npu-validation-npu-rerun
Open

[WIP] Fix redundant reruns in npu validation scripts#306
Zhendong404 wants to merge 1 commit intohw-native-sys:mainfrom
Zhendong404:fix-npu-validation-npu-rerun

Conversation

@Zhendong404
Copy link
Copy Markdown

@Zhendong404 Zhendong404 commented Mar 19, 2026

Summary

  • split the old GOLDEN_MODE=npu behavior into two explicit modes:
    • npu: run the NPU executable once and compare against pre-generated golden_*.bin
    • npu_consistency: run the NPU executable twice and compare the two runs for consistency
  • change the default GOLDEN_MODE to npu
  • add explicit golden-file checks for GOLDEN_MODE=npu and surface clear guidance when callers should use npu_consistency instead
  • align the generated run.sh template and the remote batch runner with the same mode semantics
  • load validation_meta.env in generated run.sh so custom golden handling matches the remote runner
  • avoid overwriting sample-provided golden outputs in sim mode when CUSTOM_GOLDEN=1

Testing

  • bash -n test/npu_validation/templates/run_sh_template.sh
  • bash -n test/npu_validation/scripts/run_remote_npu_validation.sh

Fixes #305.

依赖前置 PR #251

@Zhendong404 Zhendong404 changed the title Fix redundant reruns in npu validation scripts [WIP] Fix redundant reruns in npu validation scripts Mar 19, 2026
@Zhendong404 Zhendong404 force-pushed the fix-npu-validation-npu-rerun branch from d054353 to cb0de6a Compare March 19, 2026 02:09
@Zhendong404 Zhendong404 changed the title [WIP] Fix redundant reruns in npu validation scripts Fix redundant reruns in npu validation scripts Mar 19, 2026
@Zhendong404
Copy link
Copy Markdown
Author

Zhendong404 commented Mar 19, 2026

依赖前置PR #251

@Zhendong404 Zhendong404 force-pushed the fix-npu-validation-npu-rerun branch from cb0de6a to adc94d4 Compare March 30, 2026 08:59
@Zhendong404 Zhendong404 force-pushed the fix-npu-validation-npu-rerun branch from adc94d4 to f352b64 Compare March 31, 2026 12:52
@reedhecre
Copy link
Copy Markdown

Codex Review

该评论由 review 机器人自动更新。

  • PR: [WIP] Fix redundant reruns in npu validation scripts #306 Fix redundant reruns in npu validation scripts
  • Author: Zhendong404
  • Base/Head: main / fix-npu-validation-npu-rerun
  • Head SHA: f352b6401be3
  • Trigger: PR 有新提交
  • Generated At: 2026-03-31T13:05:07Z
  • Previous Head SHA: adc94d467e07
  • Status: completed

Summary

新增的 npu_precision 模式在远程生成用例与 TInsert board validation 上都不可用,且 CI 入口没有把该模式传给远端脚本。

Findings

  1. P2 远程验证里的 `npu_precision` 在干净 payload 上跑不通生成用例 test/npu_validation/scripts/run_remote_npu_validation.sh:349

这里新增的分支假定当前这次 golden.py 执行会产出 golden_*.bin,否则直接退出。但 generate_testcase.py 对没有自定义 golden 的 case 仍然会生成基于 test/npu_validation/templates/golden_template.pygolden.py,该模板只负责写输入文件,不会写任何 golden_*.bin。本仓库里 run_remote_npu_validation.sh 实际会扫描到的 *-pto.cpp 用例是 test/samples/Matmul_transpose/Matmul_transpose-pto.cpp,它也没有自定义 golden 资产,所以把 GOLDEN_MODE 设成 npu_precision 后会在 has_reference_golden_outputs 处直接 exit 2,远程验证一个 case 都跑不起来。

  1. P2 `TInsert` 的新 `npu_precision` 路径没有办法生成参考 golden 文件 test/samples/TInsert/board_validation/run.sh:153

test/samples/TInsert/board_validation/golden.py 只会生成 v1.binv4.bin,并不会生成 golden_v4.bin。因此这里新增的 npu_precision 分支在干净目录下必然报错;而如果目录里残留了旧的 golden_v4.binhas_reference_golden_outputs 又会把这个旧文件当成当前参考值继续比较,结果取决于历史运行状态。也就是说,这个新模式在该 sample 上是坏的,而且行为不稳定。

  1. P3 GitHub CI 仍然不会把 `GOLDEN_MODE` 传给远端验证脚本 .github/workflows/ci.yml:394

PR 给远端脚本新增了 GOLDEN_MODE=npu_precision 分支,但工作流里的 SSH 调用只转发了 STAGERUN_MODESOC_VERSIONPTO_ISA_REPOPTO_ISA_COMMITDEVICE_IDSKIP_CASESRUN_ONLY_CASESGOLDEN_MODE 没有被带到远端环境里,所以 GitHub Actions 里实际仍然只会走默认的 npu 路径,这个修复在 CI 主路径上不会生效。

@Zhendong404 Zhendong404 changed the title Fix redundant reruns in npu validation scripts [WIP] Fix redundant reruns in npu validation scripts Apr 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

npu_validation run.sh template redundantly reruns golden.py and NPU executable in GOLDEN_MODE=npu

3 participants