Skip to content

[NE16] Add GAP9_w_NE16 platform (NE16 accelerator on GAP9)#1

Open
runwangdl wants to merge 15 commits intodevelfrom
gap9-ne16
Open

[NE16] Add GAP9_w_NE16 platform (NE16 accelerator on GAP9)#1
runwangdl wants to merge 15 commits intodevelfrom
gap9-ne16

Conversation

@runwangdl
Copy link
Copy Markdown
Owner

@runwangdl runwangdl commented Apr 13, 2026

Adds the NE16 neural engine as an accelerator Engine on top of the existing GAP9 platform, registered as a new composite platform GAP9_w_NE16 that mirrors the Siracusa_w_neureka pattern.

Added

  • Deeploy/Targets/NE16/ — full Target: Platform/Engine/Bindings/Parsers/Tiler/Deployer/Templates/TileConstraints/TopologyOptimizationPasses. NE16Platform extends GAP9Platform with engines=[NE16Engine, GAP9ClusterEngine]; NE16Deployer extends GAP9Deployer. _weightEncode ported from pulp-nnx/test/Ne16Weight.py (single CIN_SUBTILE=16 mode).
  • DeeployTest/deeployRunner_tiled_gap9_w_ne16.py + DeeployTest/test_gap9_ne16_tiled_config.py — runner + kernel test config.
  • DeeployTest/test_platforms.py — pytest functions test_gap9_w_ne16_tiled_kernels_l2_{single,double}buffer under marker gap9_w_ne16_tiled.
  • .github/workflows/{ci-platform-gap9-w-ne16-tiled.yml,_runner-gap9-w-ne16-tiled.yml} — CI jobs (single + double buffer L2).
  • TargetLibraries/GAP9/CMakeLists.txtadd_subdirectory(pulp-nnx) with USE_NE16=ON for GAP9_w_NE16.

Changed

  • DeeployTest/testUtils/platformMapping.py — register GAP9_w_NE16 in names/mapPlatform/setupMemoryPlatform/mapDeployer.
  • DeeployTest/testMVP.py — wrap deployer with EngineColoringDeployerWrapper for GAP9_w_NE16 (without it NE16 nodes never get an engine color and parsing fails).
  • DeeployTest/testUtils/core/execution.py — append the GAP9 SDK image build target for GAP9_w_NE16 (so chip.soc.mram.bin is produced before gvsoc run).
  • CMakeLists.txt, DeeployTest/CMakeLists.txt — accept GAP9_w_NE16 alongside GAP9 in the platform branches.
  • Deeploy/Targets/NE16/Templates/ConvTemplate.py — NE16 subtile constants per ne16_task_defs.h: CIN_SUBTILE 16, output 3, weight stride d0 = 3*3*weight_d0_stride_mode8 = 18 for DW/Dense (PW qw * weight_d0_stride = 16). Emit top-level ne16_task_t fields (weight_d0_stride, qw, subtile_output_channel, kernel_shape, depthwise) that the HW reads at dispatch time.
  • Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py — DW weight layout: after Deeploy's NHWC→NCHW transpose, swap axes 0/1 once more so _weightEncode sees the standard (cout, 1, H, W) layout and produces the correct (1, 1, packed_bytes) single-block output expected by the NE16 HW.
  • Deeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.py — DW weight is a single packed block (not per-cout); constrain weightOutChannelVar == Max and reuse the same HyperRectangle((0,0,0), weightShape) for every output-channel tile.
  • Deeploy/Targets/NE16/Parsers.py — drop the group == shape[1] check in NE16DWConv2DParser (invalid under the post-encode rank-3 layout).

Fixed

  • Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py — work around a pre-existing ImportError: cannot import name 'float32_tPtr' from 'Deeploy.AbstractDataTypes' by defining it locally via PointerClass(float32_t).

Test plan

Run on gvsoc gap9.evk inside ghcr.io/pulp-platform/deeploy-gap9:devel. All verified dispatches (ne16_nnx_dispatch appears in generated Network.c for NE16-routed nodes):

Test L1 Buffer Errors Runtime (cycles)
Kernels/Integer/Conv/PW_2D_RQ/Regular_RQ 32000 single 0 / 1152 ~900k
Kernels/Integer/Conv/PW_2D_RQ/Regular_RQ 16000 single 0
Kernels/Integer/Conv/PW_2D 32000 single 0
Kernels/Integer/Conv/DW_2D_RQ 32000 single 0 / 1280 ~27k
Kernels/Integer/Conv/DW_2D_RQ 16000 single 0
Kernels/Integer/Conv/StriddedPadded_2D_RQ 32000 single 0
Kernels/Integer/Conv/PW_2D_RQ/Regular_RQ 32000 double 0
Kernels/Integer/Conv/DW_2D_RQ 32000 double 0

Follow-up (out of scope):

  • PW_2D_RQ/Unsigned_RQ uses int8 input. Ne16TestConf.py only supports uint8 and NE16 HAL doesn't expose a signed-input conf0 flag; proper support needs sign-propagation (shift int8 → uint8 + adjust weight_offset).
  • 3x3 dense-conv kernel tests don't exist in Tests/Kernels/Integer/Conv/ today (Regular_2D_RQ is 8×8); coverage is via the model path once the remaining tiling-system edge cases are resolved.

PR Merge Checklist

  1. The PR is rebased on the latest devel commit and pointing to devel.
  2. Your PR reviewed and approved.
  3. All checks are passing.
  4. The CHANGELOG.md file has been updated.
  5. If the docker was modified, change back its link after review.

marchioa and others added 9 commits February 16, 2026 19:07
Add integer MaxPool1D for Generic platform and RQSConv1D support for PULPOpen, with corresponding kernel tests.

## Added
- MaxPool1D fp32 kernel and template for Generic platform
- Test for MaxPool1D fp32 and int8
- RQSConv1 template and tiling contraints for PULP Platform
- Test for RQSConv1D int8

## Changed
- renamed MaxPool2D test from MaxPool to MaxPool/Regular_2D for both Integer and FP32

## Fixed
- im2col buffer size in Conv1d template
This is a very small PR adding @runwangdl as a code owner and mentioning people in the Changelog.md for more than one merged PR. The reasoning is that we already keep track of contributions through `git`, and we only mention authors with significant contributions to the project.

## Changed
- Extended Codeowners
- Mention people with significant contributions
This PR adds comprehensive GAP9 container support with ARM64 compatibility. It uses the latest GAP SDK version (`v5.21.2`).

## Added
- GAP9 Container Support with ARM64 architecture support
- GAP9 Container with GAP9 SDK (`v5.21.1-staging-1`)
- GAP9 Docker GitHub Build Flow (`.github/workflows/docker-build-deeploy-gap9.yml`)
- GAP9 Run script for real hardware (`scripts/gap9-run.sh`)
- Shell Format pre-commit hook
- zsh and oh-my-zsh plugin installation in containers
- New GAP9 README documentation (`README_GAP9.md`)
- GAP9-specific Docker patches for AMD64 and ARM64

## Changed
- Cleaned up Docker flow to use a temporary build folder
- Memory usage is now printed by default on GAP9
- Temporarily disabled GAP9 on forks for CI

## Fixed
- Spelling mistakes in documentation
- Missing version link
This PR adds many missing docstring comments and improves debugging, especially when using a GUI debugger, by providing more helpful `__repr__()` for the `_ReferenceBuffer` class. Additionally, it moves the `MemoryAwareClosureGeneration` and `MemoryAwarePrint*` passes from the `CommonExtensions` to the `MemoryLevelExtension`.

## Added
- Add many missing docstrings
- Add `__repr__()` function for `_ReferenceBuffer` calss

## Changed
- Move `MemoryAwareClosureGeneration` pass to `MemoryLevelExtension`
- Move `MemoryAwarePrint*` passes to `MemoryLevelExtension`
- Make `sizeInBytes` a class property instead of a function
- Move `AnnotateNeurekaWeightMemoryLevel` to `Neureka` specific folder
This PR fixes the currently broken CI. This had two reason:
1. `setuptools 82.0.0` recently removed the `pkg_resources` library which is used by GVSoC
2. The Dockerflow did not checkout git lfs files however, this is required to use the precomompiled `udma_v4_gap9_v2_impl.so`

**The GAP9 check will only pass on the fork! (See below for status)**

## Changed
- Use by default `devel` container for GAP9 CI
- Extend Readme platforms with GAP9 shields

## Fixed
- Fix Docker flow to fetch `*.so` git lfs files
- Downgrade `setuptools` to `81.0.0`
Fix broken CI cache generation by adding missing `shell: bash` directive and correcting a test case reference.

See below for successful cache generation actions:
- GAP9: https://github.com/pulp-platform/Deeploy/actions/runs/23290279441 
- Others: https://github.com/pulp-platform/Deeploy/actions/runs/23290526544

## Changed
- Added `shell: bash` to the "Generate CCache" step in `infra-generate-ccache.yml` to ensure correct shell execution
- Added `shell: bash` to the "Generate CCache for GAP9" step in `infra-generate-ccache-gap9.yml` to ensure correct shell execution

## Fixed
- Fixed wrong test case in GAP9 ccache workflow: replaced `test_gap9_tiled_kernels_l2_singlebuffer[Kernels/Integer/Add/Large-5000-L2-singlebuffer]` with `test_gap9_tiled_kernels_l2_singlebuffer[Kernels/Integer/MatMul/Regular-64000-L2-singlebuffer]`
…form#177)

* [Deeploy PR] put the tiling information into layer code as well

* [Deeploy PR] Fix the tiling information corruption. Add bracket before and after L3 code for each layer to reduce stack usage

- Previosuly the tiling information was corrupted after each run,  because the generated code put the next element in the tiling information array to current location. So after one runnetwork run, we will see that the last element in the tiling array goes to the first location, corrupting the tiling information. The fix in tilingvariablereplacement fix this by pointer the reference to the new location, instead of assigning value into the reference.

- The C code generated by Deeploy has been causing big stack usage. This is because all variables defined in Runnetwork lives in stack, including all the tiling pointers and call arguements. By adding bracket before and after each layer in RunNetwork, this makes the call args only live for one layer, thus significantly reduce the stack usage. The tiling pointers still live in stack, they need to be moved as well. But this require more changes

* Update CHANGELOG.md
…tform#162)

* Deeploy Microbenchmark with GVSoC CSR and Demo on GEMM

* Add microbenchmark to codepass

* Update pro microbenchmark  codetransformation

* Add helper function for profileMicrobenchmark

* perf-util add pre-commit

* Rebase singlebuffertilingcodegeneration

* Make workspace safe to prevent "dubious ownership" sporadic issues

* Update changelog

* Fix linting

* Add microbenchmark tutorial to docs

* Trim microbenchmark tutorial

---------

Co-authored-by: Run Wang <52746141+SamanthaWangdl@users.noreply.github.com>
Co-authored-by: Victor Jung <jungvi@iis.ee.ethz.ch>
* Add option to deploy on the board for the GAP9 platform

* Add proper D flag for GAP9 board

* Make pane name agnostic of the config

* Fix usbip host resolve for Linux platforms

* Fix hostname resolution for Macos

* Live print of the simulator cmd

* Revert gap9 docker link

* Add optional GPIO toggling for power measurements for GAP9

* Format

* Cleanup file handles to avoid unhandled exeception in pytest

* format

* Remove unused GPIO and update gitignore

* Align gap9-run.sh mount point with README convention

Mount the host working directory to /app/Deeploy inside the container
(matching README.md / README_GAP9.md) instead of /app/work.

* Document 'board' as a valid -s simulator choice

* Clarify -h text for board simulator and powerMeasurement

* README_GAP9: document -s board and --powerMeasurement

---------

Co-authored-by: Run Wang <samanthawangdl@gmail.com>
@runwangdl runwangdl force-pushed the gap9-ne16 branch 12 times, most recently from 4edb011 to 748707a Compare April 14, 2026 08:54
runwangdl and others added 2 commits April 14, 2026 10:43
Mirrors the Siracusa_w_neureka pattern. NE16Platform extends GAP9Platform
with engines=[NE16Engine, GAP9ClusterEngine]; NE16Deployer extends
GAP9Deployer (reuses ClDma transformers via GAP9Bindings).

New Target: Deeploy/Targets/NE16/ (Platform, Engine, Bindings, Parsers,
Tiler, Deployer, Templates, TileConstraints, TopologyOptimizationPasses).
The _weightEncode function is ported from pulp-nnx/test/Ne16Weight.py
(single CIN_SUBTILE=16 mode, no 1x1 vs 3x3 split). ConvTemplate subtile
constants set per ne16_task_defs.h (output 3x3, weight stride bytes
PW=16 DW/Dense=144).

New test infrastructure:
- DeeployTest/deeployRunner_tiled_gap9_w_ne16.py
- DeeployTest/test_gap9_ne16_tiled_config.py (PW/DW/Dense RQ Conv)

DeeployTest wiring:
- testUtils/platformMapping.py: register GAP9_w_NE16 in the platforms
  list, mapPlatform, setupMemoryPlatform, mapDeployer.
- testMVP.py: include GAP9_w_NE16 in the EngineColoringDeployerWrapper
  branch (without it NE16AdjustWeightMemoryLayoutPass never fires and
  parsing backtracks to exhaustion).
- testUtils/core/execution.py: build the GAP9 SDK 'image' target for
  GAP9_w_NE16 too (so chip.soc.mram.bin is produced before gvsoc run).
- CMakeLists.txt, DeeployTest/CMakeLists.txt: accept GAP9_w_NE16
  alongside GAP9 in the platform branches.
- TargetLibraries/GAP9/CMakeLists.txt: for GAP9_w_NE16 platform,
  add_subdirectory on pulp-nnx with USE_NE16=ON and link it into
  deeploygap9.

Fix: Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py referenced
an undefined symbol float32_tPtr from Deeploy.AbstractDataTypes; define
it locally via PointerClass(float32_t) to unblock the import chain
reached by NE16Platform.

Verified on gvsoc gap9.evk:
  PW 1x1 RQ  (Regular_RQ):    0/1152 errors, 901917 cycles
  DW 3x3 RQ  (DW_2D_RQ):      0/1280 errors, 27339  cycles  (--enable-3x3)
  Dense 3x3  (Regular_2D_RQ): 0/6372 errors, 244595 cycles  (--enable-3x3)
- Add NE16 linear layer kernels, including a topology pass, NE16 templates, parsers, tile constraints, and bindings
- The topology pass recognizes NE16-compatible GEMM layers, adjusts the weight layout for the NE16, and converts the requant shift/scale to the NE16 format
- The template detects whether the input is signed; if so, it adds a +128 offset to the input during C runtime and compensates via the bias
- Add GAP9 SDK-based Dequant/Quant templates using CNN_Copy.c kernels, replacing the generic templates
- Add a generic DequantQuantMergePass that folds adjacent Dequant→Quant pairs into identity or RequantShift
- Add a GAP9-specific TopologyOptimizer (GAP9Optimizer) to replace PULPOptimizer

Bug fixes:
- Add output signedness check in QuantChecker
- Fix L3 DMA template (add proper casts) and remove the blocking L3 DMA hack
- Isolate dory memory functions from other libraries in CMakeLists so they compile with -Og while compute kernels compile with -O3
- Disable PULPAddRequantMergePass due to incorrect pattern matching when Add has multiple consumers

Co-authored-by: runwangdl <samanthawangdl@gmail.com>
@runwangdl runwangdl force-pushed the gap9-ne16 branch 2 times, most recently from b8087fc to b3f40e5 Compare April 14, 2026 10:50
- TargetLibraries/GAP9/CMakeLists.txt: rename CNN_Libraries_NE16 →
  CNN_Libraries_HWPE (the actual gap9-sdk path); skip SDK
  CNN_BasicKernels_NE16.c source for GAP9_w_NE16 platform (it uses the
  pulp-nnx ne16 stack, so the SDK NE16 kernels are not needed).
- Deeploy/Targets/NE16/Platform.py: instantiate the GAP9ClusterEngine
  with a trimmed includeList (no CNN_BasicKernels_NE16.h /
  ne16_utils.h / CNN_Copy.h) so the generated Network.c does not pull
  in the SDK NE16 header alongside pulp-nnx ne16_task_defs.h — the
  NE16_REG_* macros are defined in both and trigger -Werror redefs.
ghcr.io/pulp-platform/deeploy-gap9:* is hosted in pulp-platform's
private GitHub Container Registry. Only upstream's self-hosted
runners have credentials to pull it; on fork CI runs (ubuntu-latest)
the docker pull fails with 'Error response from daemon: denied' and
the whole job is reported as failure.

Guard the select-env entry of all three gap9 workflows
(ci-platform-gap9.yml, -tiled.yml, -w-ne16-tiled.yml) so they SKIP
cleanly on forks instead of FAILING. Upstream behaviour is unchanged.
QuantChecker.checkOutputType (added by the NE16-Linear PR) requires
opSigned == outputTypeSigned. Existing Generic and PULPOpen bindings
only registered the signed-int8 output variant, so any Quant pattern
with signed=0 (e.g. 4-bit unsigned quantization in
Models/Transformer_DeepQuant) had no candidate and parsing exhausted
backtracking.

Add uint8 output to BasicQuantBindings and uint8 input to
BasicDequantBindings in both Targets/Generic/Bindings.py and
Targets/PULPOpen/Bindings.py.

Verified: Models/Transformer_DeepQuant network gen now succeeds for
both Generic and Siracusa platforms.
The Snitch FP32 GEMM/TransB-5000 build OOMs the GitHub-hosted runner
('std::bad_alloc' from the C compiler driver) when 4 pytest-xdist
workers compile in parallel. Two workers leave enough headroom on
the standard 7-GB runner.

(Pre-existing flake; surfaced as a hard fail in CI runs that happen
to land both heavy FP32 GEMM compilations on adjacent workers.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants