Skip to content

[Bug] Intermittent precision failure in paged_attention test #359

@chenshengxin2026

Description

@chenshengxin2026

Platform

a2a3 (Ascend 910B/C hardware)

Runtime Variant

tensormap_and_ringbuffer

Description

The paged_attention device test (tests/device_tests/a2a3/tensormap_and_ringbuffer/paged_attention) exhibits intermittent precision verification failures. The test passes most runs but occasionally produces output that does not match golden values.

Steps to Reproduce

1. Use the batch test script `batch_pa_test.sh` to run the paged_attention test repeatedly (100 iterations):


bash batch_pa_test.sh


The script runs the test in a loop and stops early on the first precision failure.

2. Alternatively, run the test manually in a loop:


for i in $(seq 1 100); do
  python examples/scripts/run_example.py \
    -k tests/device_tests/a2a3/tensormap_and_ringbuffer/paged_attention/kernels \
    -g tests/device_tests/a2a3/tensormap_and_ringbuffer/paged_attention/golden.py \
    -p a2a3
done


The failure typically occurs within ~85 runs but can happen at any point.

Expected Behavior

All 100 runs should pass precision verification (100 PASSED, 0 PRECISION_FAILED).

Actual Behavior

The test fails intermittently with a precision mismatch:

[INFO] Comparing out: shape=torch.Size([256, 16, 128]), dtype=torch.float32
[ERROR] TEST FAILED: Output 'out' does not match golden.
Mismatched elements: 478/524288
rtol=0.001, atol=0.001

Git Commit ID

2757be6

CANN Version

No response

Driver Version

No response

Host Platform

Linux (x86_64)

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    Status

    In Progress

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions