[NE16] Add GAP9_w_NE16 platform (NE16 accelerator on GAP9) by runwangdl · Pull Request #1 · runwangdl/Deeploy

runwangdl · 2026-04-13T22:18:50Z

Adds the NE16 neural engine as an accelerator Engine on top of the existing GAP9 platform, registered as a new composite platform GAP9_w_NE16 that mirrors the Siracusa_w_neureka pattern.

Added

Deeploy/Targets/NE16/ — full Target: Platform/Engine/Bindings/Parsers/Tiler/Deployer/Templates/TileConstraints/TopologyOptimizationPasses. NE16Platform extends GAP9Platform with engines=[NE16Engine, GAP9ClusterEngine]; NE16Deployer extends GAP9Deployer. _weightEncode ported from pulp-nnx/test/Ne16Weight.py (single CIN_SUBTILE=16 mode).
DeeployTest/deeployRunner_tiled_gap9_w_ne16.py + DeeployTest/test_gap9_ne16_tiled_config.py — runner + kernel test config.
DeeployTest/test_platforms.py — pytest functions test_gap9_w_ne16_tiled_kernels_l2_{single,double}buffer under marker gap9_w_ne16_tiled.
.github/workflows/{ci-platform-gap9-w-ne16-tiled.yml,_runner-gap9-w-ne16-tiled.yml} — CI jobs (single + double buffer L2).
TargetLibraries/GAP9/CMakeLists.txt — add_subdirectory(pulp-nnx) with USE_NE16=ON for GAP9_w_NE16.

Changed

DeeployTest/testUtils/platformMapping.py — register GAP9_w_NE16 in names/mapPlatform/setupMemoryPlatform/mapDeployer.
DeeployTest/testMVP.py — wrap deployer with EngineColoringDeployerWrapper for GAP9_w_NE16 (without it NE16 nodes never get an engine color and parsing fails).
DeeployTest/testUtils/core/execution.py — append the GAP9 SDK image build target for GAP9_w_NE16 (so chip.soc.mram.bin is produced before gvsoc run).
CMakeLists.txt, DeeployTest/CMakeLists.txt — accept GAP9_w_NE16 alongside GAP9 in the platform branches.
Deeploy/Targets/NE16/Templates/ConvTemplate.py — NE16 subtile constants per ne16_task_defs.h: CIN_SUBTILE 16, output 3, weight stride d0 = 3*3*weight_d0_stride_mode8 = 18 for DW/Dense (PW qw * weight_d0_stride = 16). Emit top-level ne16_task_t fields (weight_d0_stride, qw, subtile_output_channel, kernel_shape, depthwise) that the HW reads at dispatch time.
Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py — DW weight layout: after Deeploy's NHWC→NCHW transpose, swap axes 0/1 once more so _weightEncode sees the standard (cout, 1, H, W) layout and produces the correct (1, 1, packed_bytes) single-block output expected by the NE16 HW.
Deeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.py — DW weight is a single packed block (not per-cout); constrain weightOutChannelVar == Max and reuse the same HyperRectangle((0,0,0), weightShape) for every output-channel tile.
Deeploy/Targets/NE16/Parsers.py — drop the group == shape[1] check in NE16DWConv2DParser (invalid under the post-encode rank-3 layout).

Fixed

Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py — work around a pre-existing ImportError: cannot import name 'float32_tPtr' from 'Deeploy.AbstractDataTypes' by defining it locally via PointerClass(float32_t).

Test plan

Run on gvsoc gap9.evk inside ghcr.io/pulp-platform/deeploy-gap9:devel. All verified dispatches (ne16_nnx_dispatch appears in generated Network.c for NE16-routed nodes):

Test	L1	Buffer	Errors	Runtime (cycles)
`Kernels/Integer/Conv/PW_2D_RQ/Regular_RQ`	32000	single	0 / 1152	~900k
`Kernels/Integer/Conv/PW_2D_RQ/Regular_RQ`	16000	single	0	—
`Kernels/Integer/Conv/PW_2D`	32000	single	0	—
`Kernels/Integer/Conv/DW_2D_RQ`	32000	single	0 / 1280	~27k
`Kernels/Integer/Conv/DW_2D_RQ`	16000	single	0	—
`Kernels/Integer/Conv/StriddedPadded_2D_RQ`	32000	single	0	—
`Kernels/Integer/Conv/PW_2D_RQ/Regular_RQ`	32000	double	0	—
`Kernels/Integer/Conv/DW_2D_RQ`	32000	double	0	—

Follow-up (out of scope):

PW_2D_RQ/Unsigned_RQ uses int8 input. Ne16TestConf.py only supports uint8 and NE16 HAL doesn't expose a signed-input conf0 flag; proper support needs sign-propagation (shift int8 → uint8 + adjust weight_offset).
3x3 dense-conv kernel tests don't exist in Tests/Kernels/Integer/Conv/ today (Regular_2D_RQ is 8×8); coverage is via the model path once the remaining tiling-system edge cases are resolved.

PR Merge Checklist

The PR is rebased on the latest devel commit and pointing to devel.
Your PR reviewed and approved.
All checks are passing.
The CHANGELOG.md file has been updated.
If the docker was modified, change back its link after review.

Add integer MaxPool1D for Generic platform and RQSConv1D support for PULPOpen, with corresponding kernel tests. ## Added - MaxPool1D fp32 kernel and template for Generic platform - Test for MaxPool1D fp32 and int8 - RQSConv1 template and tiling contraints for PULP Platform - Test for RQSConv1D int8 ## Changed - renamed MaxPool2D test from MaxPool to MaxPool/Regular_2D for both Integer and FP32 ## Fixed - im2col buffer size in Conv1d template

@runwangdl

This is a very small PR adding @runwangdl as a code owner and mentioning people in the Changelog.md for more than one merged PR. The reasoning is that we already keep track of contributions through `git`, and we only mention authors with significant contributions to the project. ## Changed - Extended Codeowners - Mention people with significant contributions

This PR adds comprehensive GAP9 container support with ARM64 compatibility. It uses the latest GAP SDK version (`v5.21.2`). ## Added - GAP9 Container Support with ARM64 architecture support - GAP9 Container with GAP9 SDK (`v5.21.1-staging-1`) - GAP9 Docker GitHub Build Flow (`.github/workflows/docker-build-deeploy-gap9.yml`) - GAP9 Run script for real hardware (`scripts/gap9-run.sh`) - Shell Format pre-commit hook - zsh and oh-my-zsh plugin installation in containers - New GAP9 README documentation (`README_GAP9.md`) - GAP9-specific Docker patches for AMD64 and ARM64 ## Changed - Cleaned up Docker flow to use a temporary build folder - Memory usage is now printed by default on GAP9 - Temporarily disabled GAP9 on forks for CI ## Fixed - Spelling mistakes in documentation - Missing version link

This PR adds many missing docstring comments and improves debugging, especially when using a GUI debugger, by providing more helpful `__repr__()` for the `_ReferenceBuffer` class. Additionally, it moves the `MemoryAwareClosureGeneration` and `MemoryAwarePrint*` passes from the `CommonExtensions` to the `MemoryLevelExtension`. ## Added - Add many missing docstrings - Add `__repr__()` function for `_ReferenceBuffer` calss ## Changed - Move `MemoryAwareClosureGeneration` pass to `MemoryLevelExtension` - Move `MemoryAwarePrint*` passes to `MemoryLevelExtension` - Make `sizeInBytes` a class property instead of a function - Move `AnnotateNeurekaWeightMemoryLevel` to `Neureka` specific folder

This PR fixes the currently broken CI. This had two reason: 1. `setuptools 82.0.0` recently removed the `pkg_resources` library which is used by GVSoC 2. The Dockerflow did not checkout git lfs files however, this is required to use the precomompiled `udma_v4_gap9_v2_impl.so` **The GAP9 check will only pass on the fork! (See below for status)** ## Changed - Use by default `devel` container for GAP9 CI - Extend Readme platforms with GAP9 shields ## Fixed - Fix Docker flow to fetch `*.so` git lfs files - Downgrade `setuptools` to `81.0.0`

Fix broken CI cache generation by adding missing `shell: bash` directive and correcting a test case reference. See below for successful cache generation actions: - GAP9: https://github.com/pulp-platform/Deeploy/actions/runs/23290279441 - Others: https://github.com/pulp-platform/Deeploy/actions/runs/23290526544 ## Changed - Added `shell: bash` to the "Generate CCache" step in `infra-generate-ccache.yml` to ensure correct shell execution - Added `shell: bash` to the "Generate CCache for GAP9" step in `infra-generate-ccache-gap9.yml` to ensure correct shell execution ## Fixed - Fixed wrong test case in GAP9 ccache workflow: replaced `test_gap9_tiled_kernels_l2_singlebuffer[Kernels/Integer/Add/Large-5000-L2-singlebuffer]` with `test_gap9_tiled_kernels_l2_singlebuffer[Kernels/Integer/MatMul/Regular-64000-L2-singlebuffer]`

…form#177) * [Deeploy PR] put the tiling information into layer code as well * [Deeploy PR] Fix the tiling information corruption. Add bracket before and after L3 code for each layer to reduce stack usage - Previosuly the tiling information was corrupted after each run, because the generated code put the next element in the tiling information array to current location. So after one runnetwork run, we will see that the last element in the tiling array goes to the first location, corrupting the tiling information. The fix in tilingvariablereplacement fix this by pointer the reference to the new location, instead of assigning value into the reference. - The C code generated by Deeploy has been causing big stack usage. This is because all variables defined in Runnetwork lives in stack, including all the tiling pointers and call arguements. By adding bracket before and after each layer in RunNetwork, this makes the call args only live for one layer, thus significantly reduce the stack usage. The tiling pointers still live in stack, they need to be moved as well. But this require more changes * Update CHANGELOG.md

…tform#162) * Deeploy Microbenchmark with GVSoC CSR and Demo on GEMM * Add microbenchmark to codepass * Update pro microbenchmark codetransformation * Add helper function for profileMicrobenchmark * perf-util add pre-commit * Rebase singlebuffertilingcodegeneration * Make workspace safe to prevent "dubious ownership" sporadic issues * Update changelog * Fix linting * Add microbenchmark tutorial to docs * Trim microbenchmark tutorial --------- Co-authored-by: Run Wang <52746141+SamanthaWangdl@users.noreply.github.com> Co-authored-by: Victor Jung <jungvi@iis.ee.ethz.ch>

* Add option to deploy on the board for the GAP9 platform * Add proper D flag for GAP9 board * Make pane name agnostic of the config * Fix usbip host resolve for Linux platforms * Fix hostname resolution for Macos * Live print of the simulator cmd * Revert gap9 docker link * Add optional GPIO toggling for power measurements for GAP9 * Format * Cleanup file handles to avoid unhandled exeception in pytest * format * Remove unused GPIO and update gitignore * Align gap9-run.sh mount point with README convention Mount the host working directory to /app/Deeploy inside the container (matching README.md / README_GAP9.md) instead of /app/work. * Document 'board' as a valid -s simulator choice * Clarify -h text for board simulator and powerMeasurement * README_GAP9: document -s board and --powerMeasurement --------- Co-authored-by: Run Wang <samanthawangdl@gmail.com>

Mirrors the Siracusa_w_neureka pattern. NE16Platform extends GAP9Platform with engines=[NE16Engine, GAP9ClusterEngine]; NE16Deployer extends GAP9Deployer (reuses ClDma transformers via GAP9Bindings). New Target: Deeploy/Targets/NE16/ (Platform, Engine, Bindings, Parsers, Tiler, Deployer, Templates, TileConstraints, TopologyOptimizationPasses). The _weightEncode function is ported from pulp-nnx/test/Ne16Weight.py (single CIN_SUBTILE=16 mode, no 1x1 vs 3x3 split). ConvTemplate subtile constants set per ne16_task_defs.h (output 3x3, weight stride bytes PW=16 DW/Dense=144). New test infrastructure: - DeeployTest/deeployRunner_tiled_gap9_w_ne16.py - DeeployTest/test_gap9_ne16_tiled_config.py (PW/DW/Dense RQ Conv) DeeployTest wiring: - testUtils/platformMapping.py: register GAP9_w_NE16 in the platforms list, mapPlatform, setupMemoryPlatform, mapDeployer. - testMVP.py: include GAP9_w_NE16 in the EngineColoringDeployerWrapper branch (without it NE16AdjustWeightMemoryLayoutPass never fires and parsing backtracks to exhaustion). - testUtils/core/execution.py: build the GAP9 SDK 'image' target for GAP9_w_NE16 too (so chip.soc.mram.bin is produced before gvsoc run). - CMakeLists.txt, DeeployTest/CMakeLists.txt: accept GAP9_w_NE16 alongside GAP9 in the platform branches. - TargetLibraries/GAP9/CMakeLists.txt: for GAP9_w_NE16 platform, add_subdirectory on pulp-nnx with USE_NE16=ON and link it into deeploygap9. Fix: Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py referenced an undefined symbol float32_tPtr from Deeploy.AbstractDataTypes; define it locally via PointerClass(float32_t) to unblock the import chain reached by NE16Platform. Verified on gvsoc gap9.evk: PW 1x1 RQ (Regular_RQ): 0/1152 errors, 901917 cycles DW 3x3 RQ (DW_2D_RQ): 0/1280 errors, 27339 cycles (--enable-3x3) Dense 3x3 (Regular_2D_RQ): 0/6372 errors, 244595 cycles (--enable-3x3)

- Add NE16 linear layer kernels, including a topology pass, NE16 templates, parsers, tile constraints, and bindings - The topology pass recognizes NE16-compatible GEMM layers, adjusts the weight layout for the NE16, and converts the requant shift/scale to the NE16 format - The template detects whether the input is signed; if so, it adds a +128 offset to the input during C runtime and compensates via the bias - Add GAP9 SDK-based Dequant/Quant templates using CNN_Copy.c kernels, replacing the generic templates - Add a generic DequantQuantMergePass that folds adjacent Dequant→Quant pairs into identity or RequantShift - Add a GAP9-specific TopologyOptimizer (GAP9Optimizer) to replace PULPOptimizer Bug fixes: - Add output signedness check in QuantChecker - Fix L3 DMA template (add proper casts) and remove the blocking L3 DMA hack - Isolate dory memory functions from other libraries in CMakeLists so they compile with -Og while compute kernels compile with -O3 - Disable PULPAddRequantMergePass due to incorrect pattern matching when Add has multiple consumers Co-authored-by: runwangdl <samanthawangdl@gmail.com>

- TargetLibraries/GAP9/CMakeLists.txt: rename CNN_Libraries_NE16 → CNN_Libraries_HWPE (the actual gap9-sdk path); skip SDK CNN_BasicKernels_NE16.c source for GAP9_w_NE16 platform (it uses the pulp-nnx ne16 stack, so the SDK NE16 kernels are not needed). - Deeploy/Targets/NE16/Platform.py: instantiate the GAP9ClusterEngine with a trimmed includeList (no CNN_BasicKernels_NE16.h / ne16_utils.h / CNN_Copy.h) so the generated Network.c does not pull in the SDK NE16 header alongside pulp-nnx ne16_task_defs.h — the NE16_REG_* macros are defined in both and trigger -Werror redefs.

ghcr.io/pulp-platform/deeploy-gap9:* is hosted in pulp-platform's private GitHub Container Registry. Only upstream's self-hosted runners have credentials to pull it; on fork CI runs (ubuntu-latest) the docker pull fails with 'Error response from daemon: denied' and the whole job is reported as failure. Guard the select-env entry of all three gap9 workflows (ci-platform-gap9.yml, -tiled.yml, -w-ne16-tiled.yml) so they SKIP cleanly on forks instead of FAILING. Upstream behaviour is unchanged.

QuantChecker.checkOutputType (added by the NE16-Linear PR) requires opSigned == outputTypeSigned. Existing Generic and PULPOpen bindings only registered the signed-int8 output variant, so any Quant pattern with signed=0 (e.g. 4-bit unsigned quantization in Models/Transformer_DeepQuant) had no candidate and parsing exhausted backtracking. Add uint8 output to BasicQuantBindings and uint8 input to BasicDequantBindings in both Targets/Generic/Bindings.py and Targets/PULPOpen/Bindings.py. Verified: Models/Transformer_DeepQuant network gen now succeeds for both Generic and Siracusa platforms.

The Snitch FP32 GEMM/TransB-5000 build OOMs the GitHub-hosted runner ('std::bad_alloc' from the C compiler driver) when 4 pytest-xdist workers compile in parallel. Two workers leave enough headroom on the standard 7-GB runner. (Pre-existing flake; surfaced as a hard fail in CI runs that happen to land both heavy FP32 GEMM compilations on adjacent workers.)

marchioa and others added 9 commits February 16, 2026 19:07

runwangdl requested review from Victor-Jung and Xeratec as code owners April 13, 2026 22:18

runwangdl force-pushed the gap9-ne16 branch 12 times, most recently from 4edb011 to 748707a Compare April 14, 2026 08:54

runwangdl and others added 2 commits April 14, 2026 10:43

runwangdl force-pushed the gap9-ne16 branch 2 times, most recently from b8087fc to b3f40e5 Compare April 14, 2026 10:50

runwangdl force-pushed the gap9-ne16 branch from b3f40e5 to 6c8ae2b Compare April 14, 2026 10:59

runwangdl added 2 commits April 14, 2026 13:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NE16] Add GAP9_w_NE16 platform (NE16 accelerator on GAP9)#1

[NE16] Add GAP9_w_NE16 platform (NE16 accelerator on GAP9)#1
runwangdl wants to merge 15 commits intodevelfrom
gap9-ne16

runwangdl commented Apr 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

runwangdl commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Added

Changed

Fixed

Test plan

PR Merge Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

runwangdl commented Apr 13, 2026 •

edited

Loading