-
Notifications
You must be signed in to change notification settings - Fork 41
[Bug] Intermittent precision failure in paged_attention test #359
Copy link
Copy link
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Platform
a2a3 (Ascend 910B/C hardware)
Runtime Variant
tensormap_and_ringbuffer
Description
The paged_attention device test (tests/device_tests/a2a3/tensormap_and_ringbuffer/paged_attention) exhibits intermittent precision verification failures. The test passes most runs but occasionally produces output that does not match golden values.
Steps to Reproduce
1. Use the batch test script `batch_pa_test.sh` to run the paged_attention test repeatedly (100 iterations):
bash batch_pa_test.sh
The script runs the test in a loop and stops early on the first precision failure.
2. Alternatively, run the test manually in a loop:
for i in $(seq 1 100); do
python examples/scripts/run_example.py \
-k tests/device_tests/a2a3/tensormap_and_ringbuffer/paged_attention/kernels \
-g tests/device_tests/a2a3/tensormap_and_ringbuffer/paged_attention/golden.py \
-p a2a3
done
The failure typically occurs within ~85 runs but can happen at any point.Expected Behavior
All 100 runs should pass precision verification (100 PASSED, 0 PRECISION_FAILED).
Actual Behavior
The test fails intermittently with a precision mismatch:
[INFO] Comparing out: shape=torch.Size([256, 16, 128]), dtype=torch.float32
[ERROR] TEST FAILED: Output 'out' does not match golden.
Mismatched elements: 478/524288
rtol=0.001, atol=0.001
Git Commit ID
CANN Version
No response
Driver Version
No response
Host Platform
Linux (x86_64)
Additional Context
No response
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working
Type
Projects
Status
In Progress