Conversation
Add integer MaxPool1D for Generic platform and RQSConv1D support for PULPOpen, with corresponding kernel tests. ## Added - MaxPool1D fp32 kernel and template for Generic platform - Test for MaxPool1D fp32 and int8 - RQSConv1 template and tiling contraints for PULP Platform - Test for RQSConv1D int8 ## Changed - renamed MaxPool2D test from MaxPool to MaxPool/Regular_2D for both Integer and FP32 ## Fixed - im2col buffer size in Conv1d template
This is a very small PR adding @runwangdl as a code owner and mentioning people in the Changelog.md for more than one merged PR. The reasoning is that we already keep track of contributions through `git`, and we only mention authors with significant contributions to the project. ## Changed - Extended Codeowners - Mention people with significant contributions
This PR adds comprehensive GAP9 container support with ARM64 compatibility. It uses the latest GAP SDK version (`v5.21.2`). ## Added - GAP9 Container Support with ARM64 architecture support - GAP9 Container with GAP9 SDK (`v5.21.1-staging-1`) - GAP9 Docker GitHub Build Flow (`.github/workflows/docker-build-deeploy-gap9.yml`) - GAP9 Run script for real hardware (`scripts/gap9-run.sh`) - Shell Format pre-commit hook - zsh and oh-my-zsh plugin installation in containers - New GAP9 README documentation (`README_GAP9.md`) - GAP9-specific Docker patches for AMD64 and ARM64 ## Changed - Cleaned up Docker flow to use a temporary build folder - Memory usage is now printed by default on GAP9 - Temporarily disabled GAP9 on forks for CI ## Fixed - Spelling mistakes in documentation - Missing version link
This PR adds many missing docstring comments and improves debugging, especially when using a GUI debugger, by providing more helpful `__repr__()` for the `_ReferenceBuffer` class. Additionally, it moves the `MemoryAwareClosureGeneration` and `MemoryAwarePrint*` passes from the `CommonExtensions` to the `MemoryLevelExtension`. ## Added - Add many missing docstrings - Add `__repr__()` function for `_ReferenceBuffer` calss ## Changed - Move `MemoryAwareClosureGeneration` pass to `MemoryLevelExtension` - Move `MemoryAwarePrint*` passes to `MemoryLevelExtension` - Make `sizeInBytes` a class property instead of a function - Move `AnnotateNeurekaWeightMemoryLevel` to `Neureka` specific folder
This PR fixes the currently broken CI. This had two reason: 1. `setuptools 82.0.0` recently removed the `pkg_resources` library which is used by GVSoC 2. The Dockerflow did not checkout git lfs files however, this is required to use the precomompiled `udma_v4_gap9_v2_impl.so` **The GAP9 check will only pass on the fork! (See below for status)** ## Changed - Use by default `devel` container for GAP9 CI - Extend Readme platforms with GAP9 shields ## Fixed - Fix Docker flow to fetch `*.so` git lfs files - Downgrade `setuptools` to `81.0.0`
Fix broken CI cache generation by adding missing `shell: bash` directive and correcting a test case reference. See below for successful cache generation actions: - GAP9: https://github.com/pulp-platform/Deeploy/actions/runs/23290279441 - Others: https://github.com/pulp-platform/Deeploy/actions/runs/23290526544 ## Changed - Added `shell: bash` to the "Generate CCache" step in `infra-generate-ccache.yml` to ensure correct shell execution - Added `shell: bash` to the "Generate CCache for GAP9" step in `infra-generate-ccache-gap9.yml` to ensure correct shell execution ## Fixed - Fixed wrong test case in GAP9 ccache workflow: replaced `test_gap9_tiled_kernels_l2_singlebuffer[Kernels/Integer/Add/Large-5000-L2-singlebuffer]` with `test_gap9_tiled_kernels_l2_singlebuffer[Kernels/Integer/MatMul/Regular-64000-L2-singlebuffer]`
…form#177) * [Deeploy PR] put the tiling information into layer code as well * [Deeploy PR] Fix the tiling information corruption. Add bracket before and after L3 code for each layer to reduce stack usage - Previosuly the tiling information was corrupted after each run, because the generated code put the next element in the tiling information array to current location. So after one runnetwork run, we will see that the last element in the tiling array goes to the first location, corrupting the tiling information. The fix in tilingvariablereplacement fix this by pointer the reference to the new location, instead of assigning value into the reference. - The C code generated by Deeploy has been causing big stack usage. This is because all variables defined in Runnetwork lives in stack, including all the tiling pointers and call arguements. By adding bracket before and after each layer in RunNetwork, this makes the call args only live for one layer, thus significantly reduce the stack usage. The tiling pointers still live in stack, they need to be moved as well. But this require more changes * Update CHANGELOG.md
…tform#162) * Deeploy Microbenchmark with GVSoC CSR and Demo on GEMM * Add microbenchmark to codepass * Update pro microbenchmark codetransformation * Add helper function for profileMicrobenchmark * perf-util add pre-commit * Rebase singlebuffertilingcodegeneration * Make workspace safe to prevent "dubious ownership" sporadic issues * Update changelog * Fix linting * Add microbenchmark tutorial to docs * Trim microbenchmark tutorial --------- Co-authored-by: Run Wang <52746141+SamanthaWangdl@users.noreply.github.com> Co-authored-by: Victor Jung <jungvi@iis.ee.ethz.ch>
* Add option to deploy on the board for the GAP9 platform * Add proper D flag for GAP9 board * Make pane name agnostic of the config * Fix usbip host resolve for Linux platforms * Fix hostname resolution for Macos * Live print of the simulator cmd * Revert gap9 docker link * Add optional GPIO toggling for power measurements for GAP9 * Format * Cleanup file handles to avoid unhandled exeception in pytest * format * Remove unused GPIO and update gitignore * Align gap9-run.sh mount point with README convention Mount the host working directory to /app/Deeploy inside the container (matching README.md / README_GAP9.md) instead of /app/work. * Document 'board' as a valid -s simulator choice * Clarify -h text for board simulator and powerMeasurement * README_GAP9: document -s board and --powerMeasurement --------- Co-authored-by: Run Wang <samanthawangdl@gmail.com>
4edb011 to
748707a
Compare
Mirrors the Siracusa_w_neureka pattern. NE16Platform extends GAP9Platform with engines=[NE16Engine, GAP9ClusterEngine]; NE16Deployer extends GAP9Deployer (reuses ClDma transformers via GAP9Bindings). New Target: Deeploy/Targets/NE16/ (Platform, Engine, Bindings, Parsers, Tiler, Deployer, Templates, TileConstraints, TopologyOptimizationPasses). The _weightEncode function is ported from pulp-nnx/test/Ne16Weight.py (single CIN_SUBTILE=16 mode, no 1x1 vs 3x3 split). ConvTemplate subtile constants set per ne16_task_defs.h (output 3x3, weight stride bytes PW=16 DW/Dense=144). New test infrastructure: - DeeployTest/deeployRunner_tiled_gap9_w_ne16.py - DeeployTest/test_gap9_ne16_tiled_config.py (PW/DW/Dense RQ Conv) DeeployTest wiring: - testUtils/platformMapping.py: register GAP9_w_NE16 in the platforms list, mapPlatform, setupMemoryPlatform, mapDeployer. - testMVP.py: include GAP9_w_NE16 in the EngineColoringDeployerWrapper branch (without it NE16AdjustWeightMemoryLayoutPass never fires and parsing backtracks to exhaustion). - testUtils/core/execution.py: build the GAP9 SDK 'image' target for GAP9_w_NE16 too (so chip.soc.mram.bin is produced before gvsoc run). - CMakeLists.txt, DeeployTest/CMakeLists.txt: accept GAP9_w_NE16 alongside GAP9 in the platform branches. - TargetLibraries/GAP9/CMakeLists.txt: for GAP9_w_NE16 platform, add_subdirectory on pulp-nnx with USE_NE16=ON and link it into deeploygap9. Fix: Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py referenced an undefined symbol float32_tPtr from Deeploy.AbstractDataTypes; define it locally via PointerClass(float32_t) to unblock the import chain reached by NE16Platform. Verified on gvsoc gap9.evk: PW 1x1 RQ (Regular_RQ): 0/1152 errors, 901917 cycles DW 3x3 RQ (DW_2D_RQ): 0/1280 errors, 27339 cycles (--enable-3x3) Dense 3x3 (Regular_2D_RQ): 0/6372 errors, 244595 cycles (--enable-3x3)
- Add NE16 linear layer kernels, including a topology pass, NE16 templates, parsers, tile constraints, and bindings - The topology pass recognizes NE16-compatible GEMM layers, adjusts the weight layout for the NE16, and converts the requant shift/scale to the NE16 format - The template detects whether the input is signed; if so, it adds a +128 offset to the input during C runtime and compensates via the bias - Add GAP9 SDK-based Dequant/Quant templates using CNN_Copy.c kernels, replacing the generic templates - Add a generic DequantQuantMergePass that folds adjacent Dequant→Quant pairs into identity or RequantShift - Add a GAP9-specific TopologyOptimizer (GAP9Optimizer) to replace PULPOptimizer Bug fixes: - Add output signedness check in QuantChecker - Fix L3 DMA template (add proper casts) and remove the blocking L3 DMA hack - Isolate dory memory functions from other libraries in CMakeLists so they compile with -Og while compute kernels compile with -O3 - Disable PULPAddRequantMergePass due to incorrect pattern matching when Add has multiple consumers Co-authored-by: runwangdl <samanthawangdl@gmail.com>
b8087fc to
b3f40e5
Compare
- TargetLibraries/GAP9/CMakeLists.txt: rename CNN_Libraries_NE16 → CNN_Libraries_HWPE (the actual gap9-sdk path); skip SDK CNN_BasicKernels_NE16.c source for GAP9_w_NE16 platform (it uses the pulp-nnx ne16 stack, so the SDK NE16 kernels are not needed). - Deeploy/Targets/NE16/Platform.py: instantiate the GAP9ClusterEngine with a trimmed includeList (no CNN_BasicKernels_NE16.h / ne16_utils.h / CNN_Copy.h) so the generated Network.c does not pull in the SDK NE16 header alongside pulp-nnx ne16_task_defs.h — the NE16_REG_* macros are defined in both and trigger -Werror redefs.
ghcr.io/pulp-platform/deeploy-gap9:* is hosted in pulp-platform's private GitHub Container Registry. Only upstream's self-hosted runners have credentials to pull it; on fork CI runs (ubuntu-latest) the docker pull fails with 'Error response from daemon: denied' and the whole job is reported as failure. Guard the select-env entry of all three gap9 workflows (ci-platform-gap9.yml, -tiled.yml, -w-ne16-tiled.yml) so they SKIP cleanly on forks instead of FAILING. Upstream behaviour is unchanged.
QuantChecker.checkOutputType (added by the NE16-Linear PR) requires opSigned == outputTypeSigned. Existing Generic and PULPOpen bindings only registered the signed-int8 output variant, so any Quant pattern with signed=0 (e.g. 4-bit unsigned quantization in Models/Transformer_DeepQuant) had no candidate and parsing exhausted backtracking. Add uint8 output to BasicQuantBindings and uint8 input to BasicDequantBindings in both Targets/Generic/Bindings.py and Targets/PULPOpen/Bindings.py. Verified: Models/Transformer_DeepQuant network gen now succeeds for both Generic and Siracusa platforms.
The Snitch FP32 GEMM/TransB-5000 build OOMs the GitHub-hosted runner
('std::bad_alloc' from the C compiler driver) when 4 pytest-xdist
workers compile in parallel. Two workers leave enough headroom on
the standard 7-GB runner.
(Pre-existing flake; surfaced as a hard fail in CI runs that happen
to land both heavy FP32 GEMM compilations on adjacent workers.)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds the NE16 neural engine as an accelerator Engine on top of the existing GAP9 platform, registered as a new composite platform
GAP9_w_NE16that mirrors theSiracusa_w_neurekapattern.Added
Deeploy/Targets/NE16/— full Target: Platform/Engine/Bindings/Parsers/Tiler/Deployer/Templates/TileConstraints/TopologyOptimizationPasses.NE16PlatformextendsGAP9Platformwithengines=[NE16Engine, GAP9ClusterEngine];NE16DeployerextendsGAP9Deployer._weightEncodeported frompulp-nnx/test/Ne16Weight.py(single CIN_SUBTILE=16 mode).DeeployTest/deeployRunner_tiled_gap9_w_ne16.py+DeeployTest/test_gap9_ne16_tiled_config.py— runner + kernel test config.DeeployTest/test_platforms.py— pytest functionstest_gap9_w_ne16_tiled_kernels_l2_{single,double}bufferunder markergap9_w_ne16_tiled..github/workflows/{ci-platform-gap9-w-ne16-tiled.yml,_runner-gap9-w-ne16-tiled.yml}— CI jobs (single + double buffer L2).TargetLibraries/GAP9/CMakeLists.txt—add_subdirectory(pulp-nnx)withUSE_NE16=ONforGAP9_w_NE16.Changed
DeeployTest/testUtils/platformMapping.py— registerGAP9_w_NE16in names/mapPlatform/setupMemoryPlatform/mapDeployer.DeeployTest/testMVP.py— wrap deployer withEngineColoringDeployerWrapperforGAP9_w_NE16(without it NE16 nodes never get an engine color and parsing fails).DeeployTest/testUtils/core/execution.py— append the GAP9 SDKimagebuild target forGAP9_w_NE16(sochip.soc.mram.binis produced beforegvsoc run).CMakeLists.txt,DeeployTest/CMakeLists.txt— acceptGAP9_w_NE16alongsideGAP9in the platform branches.Deeploy/Targets/NE16/Templates/ConvTemplate.py— NE16 subtile constants perne16_task_defs.h:CIN_SUBTILE16, output3, weight strided0 = 3*3*weight_d0_stride_mode8 = 18for DW/Dense (PWqw * weight_d0_stride = 16). Emit top-levelne16_task_tfields (weight_d0_stride,qw,subtile_output_channel,kernel_shape,depthwise) that the HW reads at dispatch time.Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py— DW weight layout: after Deeploy's NHWC→NCHW transpose, swap axes 0/1 once more so_weightEncodesees the standard(cout, 1, H, W)layout and produces the correct(1, 1, packed_bytes)single-block output expected by the NE16 HW.Deeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.py— DW weight is a single packed block (not per-cout); constrainweightOutChannelVar == Maxand reuse the sameHyperRectangle((0,0,0), weightShape)for every output-channel tile.Deeploy/Targets/NE16/Parsers.py— drop thegroup == shape[1]check inNE16DWConv2DParser(invalid under the post-encode rank-3 layout).Fixed
Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py— work around a pre-existingImportError: cannot import name 'float32_tPtr' from 'Deeploy.AbstractDataTypes'by defining it locally viaPointerClass(float32_t).Test plan
Run on gvsoc
gap9.evkinsideghcr.io/pulp-platform/deeploy-gap9:devel. All verified dispatches (ne16_nnx_dispatchappears in generated Network.c for NE16-routed nodes):Kernels/Integer/Conv/PW_2D_RQ/Regular_RQKernels/Integer/Conv/PW_2D_RQ/Regular_RQKernels/Integer/Conv/PW_2DKernels/Integer/Conv/DW_2D_RQKernels/Integer/Conv/DW_2D_RQKernels/Integer/Conv/StriddedPadded_2D_RQKernels/Integer/Conv/PW_2D_RQ/Regular_RQKernels/Integer/Conv/DW_2D_RQFollow-up (out of scope):
PW_2D_RQ/Unsigned_RQuses int8 input.Ne16TestConf.pyonly supports uint8 and NE16 HAL doesn't expose a signed-input conf0 flag; proper support needs sign-propagation (shift int8 → uint8 + adjust weight_offset).Tests/Kernels/Integer/Conv/today (Regular_2D_RQis 8×8); coverage is via the model path once the remaining tiling-system edge cases are resolved.PR Merge Checklist
develcommit and pointing todevel.CHANGELOG.mdfile has been updated.