中文版 / Chinese: README.zh-CN.md
For each CANN 9.1.0-beta.1 API, write a minimal runnable example, run it under cannsim card-free (no-NPU) simulation (CAModel), and record a performance report.
System requirements: Linux aarch64 (arm64v8+) or x86_64. install.sh auto-detects the host CPU via uname -m and downloads the matching CANN .run package + miniforge installer (override with ARCH=aarch64|x86_64). Auto-installing the system libraries (libnuma1/libgomp1/libssl-dev) assumes a Debian/Ubuntu-family (apt) distro; on non-apt distros, install those libraries manually.
Results are tagged by host arch so the two platforms never clobber each other: aarch64 keeps the original unsuffixed names (RESULT.md, reports/INDEX.md); other arches get a suffix (e.g. RESULT.x86_64.md, reports/INDEX.x86_64.md). The deterministic CAModel metrics (instruction count, execution-time ns) are host-independent — they should match across arches for the same kernel — while the host CPU only affects the simulator's own wall-clock speed.
The environment is built by this directory's install.sh, which installs the CANN Toolkit + the cannsim CLI into an isolated conda env cannsim (toolkit only, no NPU driver — pure card-free simulation):
./install.sh— full install: download the CANN.run→ miniforge3 → create the conda env → Toolkit + cannsim wheel → smoke test../install.sh test— smoke test only: cannsim CLI / camodel checks + real-operator simulation of thevector/addunit (compiles an AscendC kernel and runs it on CAModel through the harness)../install.sh test-all— test + full re-run of all units (run_all.sh)../install.sh uninstall— remove the conda env (keeps miniforge and the downloaded.run).
Env knobs: ARCH (default uname -m), ENV_NAME (default cannsim), PY_VER, MINIFORGE_DIR, SKIP_REAL_SIM=1 (skip real-op sim), RUN_ALL=1 (run all units in test). The real-operator simulation reuses harness/run_one.sh examples/ascendc/vector/add, which is self-contained (sources conda + set_env, builds, records, verifies).
- Aggregate table (instr count / time / report links):
reports/INDEX.md(aarch64) ·reports/INDEX.x86_64.md(x86_64) - Performance analysis (instr/time comparison + insights):
reports/PERF.md - Cross-arch comparison (aarch64 vs x86_64):
reports/PERF.x86_64.md - API doc index:
docs/INDEX.md
- binary
(dst,src0,src1,count): Add/Sub/Mul/Div/Max/Min - scalar
(dst,src,scalar,count): Adds/Muls/Subs/Divs/Maxs/Mins - unary
(dst,src,count): Exp/Ln/Abs/Sqrt/Rsqrt/Reciprocal/Relu/Neg - cast
(dst,src,roundMode,count): Cast (float→int32) - reduce
(dst,src,tmp,count): ReduceSum/ReduceMax/ReduceMin - activation / math (adv_api simple count mode): Sigmoid/Gelu/Silu/Swish + unary Sin/Cos/Tan/Tanh/Sinh/Cosh/Asin/Acos/Atan/Erf/Erfc/Floor/Ceil/Round/Rint/Trunc/Sign/Frac + binary Power/Fmod/Hypot
- high-level + hand-filled tiling: RmsNorm / LayerNorm / GroupNorm / DeepNorm / BatchNorm / Pad (a few tiling fields hand-filled inside the kernel for a fixed shape)
- high-level + device tiling: SoftMax (
SoftMaxTilingFunc), LogSoftmax (reuses SoftMaxTilingFunc to fill an isomorphic tiling; measured to use log10), Broadcast (GetBroadcastTilingInfo) (tiling computed inside the kernel) - multi-core sync: SyncAll (all-core barrier), IBSet/IBWait (cross-core one-to-one flag, chained dependency) (blockDim=8, DataCopy for GM communication)
- high-level + host tiling: TopK (host
TopKTilingFunccomputes → passed in via GM) - high-level tiling-free: Sort (count mode, ascending), Transpose (vtranspose b16, 16×16)
- cube + host tiling: Matmul (host
matmul_tilingcomputes TCubeTiling → passed in via GM → Cube) - manual (hand-written): Select (+CompareScalar), DataCopy
manifest.yaml # source of truth: each API's arity/dtype/inputs/expectation/status
harness/
templates/*.in # CMakeLists + kernel/host templates per arity
gen.py # manifest -> examples/<lib>/<cat>/<api>/{kernel,main,CMakeLists,meta,doc}
run_one.sh <dir> # set_env -> build -> cannsim record -g -> RESULT.md
run_all.sh [--gen] [..] # batch build + simulate
aggregate.py # aggregate -> reports/INDEX.md
examples/<lib>/<cat>/<api>/ # one unit per API (doc.md/doc.zh-CN.md/kernel.cpp/main.cpp/CMakeLists.txt/meta.json/RESULT.md/report/)
docs/INDEX.md # human-readable API index
Every example unit is uniform: float dtype, 8 cores, double buffering, build SOC Ascend950PR_9599, simulation SOC Ascend950.
The function prototypes in doc.md are extracted from the toolkit headers by gen.py (authoritative, offline) rather than hand-guessed.
- Add an entry to
manifest.yaml(pick the arity, fill inputs/expect; for a new signature, check the toolkit header first). python3 harness/gen.py <Name>to generate the unit.bash harness/run_one.sh examples/.../<name>to build and simulate.python3 harness/aggregate.pyto refresh the aggregate table.
A new arity (different signature) needs a template under harness/templates/ and registration in gen.py's ARITY_MAP.
- Track A (simulatable Ascend C kernel API) = core ✅ done (172/172): 9 arity families + 4 tiling techniques (device TilingFunc / few-field hand-fill / host-tiling framework / tiling-free) + Cube (Matmul) + multi-core sync (SyncAll/IBSet/IBWait) + data-rearrange/atomic/scalar. The following adv_api computation sub-libraries are covered — beyond the mainstream set: fused binary (AddRelu/SubRelu/FusedMulAdd/MulAddDst/MulAddRelu/AbsSub/ExpSub/Mull), shifts (ShiftLeft/ShiftRight), scalar variants (Ands/Ors/LeakyRelu/MulsCast), Prelu, inverse-hyperbolic (Acosh/Asinh/Atanh), Digamma, ClampMax/Min, Fma, SinCos, GeGLU, Xor, logical (LogicalAnd/Or/Xor/Not — bool out), predicate (IsNan/IsInf/IsFinite — bool out), Where, CumSum; gated activations (SwiGLU/ReGLU), adv_api reductions (Mean/Sum, last-axis), index generation (ArithProgression), quantization (AscendQuant/AscendDequant/AscendAntiQuant), and normalization extras (Normalize/WelfordFinalize/LayerNormGrad/LayerNormGradBeta/DropOut). A further batch of adv_api kernel primitives was filled in after cross-checking against the cann-on-gpu operator library: softmax backward / flash-attention online softmax (SoftmaxGrad/SoftmaxFlashV2), the adv_api reduce family Any/All/Prod/XorSum (ReduceAny/ReduceAll/ReduceProd/ReduceXorSum, distinct from the older vector ReduceSum/Max/Min), byte-mask select (SelectWithBytesMask), format conversion (ConfusionTranspose/TransData), and tensor-scalar logical + bitwise count-mode variants (LogicalAnds/LogicalOrs/BitwiseAnd/BitwiseOr/BitwiseNot). A later pass over the local AscendC kernel headers added the remaining primitives: reduce-granularity variants (BlockReduceSum/Max/Min per-32B-block, WholeReduceSum/Max/Min per-256B-repeat, PairReduceSum adjacent-pair, RepeatReduceSum), data-rearrange (Interleave/DeInterleave, block gather Gatherb), vector-scalar compare-to-mask (Compares, the canonical replacement for CompareScalar), truncate-round (Truncate, keeps dtype, distinct from math Trunc), log2/log10 (Log2/Log10), faster-gelu (FasterGelu/FasterGeluV2), simple/flashV3/grad-front softmax (SimpleSoftMax/SoftmaxFlashV3/SoftmaxGradFront), unpad (UnPad, inverse of Pad), welford-update (WelfordUpdate online step), multi-policy Quantize (QuantizeConfig, distinct from AscendQuant), and relu-cast/dequant variants (AddReluCast/SubReluCast/AddDeqRelu/CastDequant). Five deprecated aliases (FusedMulsCast/FusedAbsSub/FusedExpSub/FusedMulAddRelu/CastDeq) are already covered by their canonical names and are not counted. Remaining uncovered (Philox RNG, arch-limited Conv3D/Conv2D) are itemized in
docs/INDEX.md. - Track B (Runtime/ACL host API) = ✅ done: the host-side scaffolding that launches kernels, systematically documented in
docs/runtime/host_api.md. - Track C (GE/HCCL/HIXL/DVPP/ATB/SiP) = documented as not covered: per-library "why not covered + future conditions" in
docs/notcovered.md(needs a real driver / multiple cards / dedicated hardware). - Pending real hardware: Conv3D/Conv2D (the toolkit only ships the m220 implementation; needs a 910B-series environment to validate).