These principles are non-negotiable. Every contribution must align with them.
PyGPUkit makes GPU programming feel like using a standard Python library: pip-installable, minimal setup, no mandatory CUDA Toolkit.
- Explicit over implicit - GPU operations are visible, not hidden
- Performance is a prerequisite - Slower than cuBLAS requires justification
- NumPy-like semantics -
C = A @ B, not opaque operator graphs - GPU as a schedulable resource - Kubernetes-inspired admission control
Python - High-level orchestration ONLY
Rust - Memory pool, scheduler, GPU coordination
C++ - CUDA Driver/Runtime API, NVRTC, kernel launch
Python must remain a thin wrapper. Performance-critical logic belongs in Rust or C++.
| Type | Examples |
|---|---|
| Performance improvements | Faster kernels, better memory patterns |
| New GPU operations | Ops that fit the GPUArray model |
| Bug fixes | Correctness issues, memory leaks |
| SM architecture support | New GPU generations (with benchmarks) |
| Documentation | Clarifications, examples, typo fixes |
| Type | Reason |
|---|---|
| Python CUDA wrappers | No cuda-python, numba.cuda, cupy.cuda |
| Training features | Autograd, optimizers, training loops |
| Legacy GPU support | SM < 80 (Turing and below) |
| Magic/implicit behavior | Hidden allocations, undocumented heuristics |
| Over-engineering | Features for hypothetical future needs |
- New module additions (e.g., vision, TTS)
- Alternative backends (ROCm, Metal)
- Breaking API changes
These rules cannot be violated. PRs that break them will be rejected.
Python API --> pybind11 --> C++ --> CUDA Driver/Runtime/NVRTC
|
+--> PyO3 --> Rust (memory, scheduler)
These MUST NOT be removed or reimplemented in Python:
- Memory pool with LRU eviction (
rust/pygpukit-core/src/memory/) - GPU scheduler state machine (
rust/pygpukit-core/src/scheduler/) - Async GPU memory transfer engine
- Kernel dispatch controller
| Module | Modality | Input | Output |
|---|---|---|---|
ops/ |
Tensors | GPUArray | GPUArray |
llm/ |
Text | Tokens | Tokens |
asr/ |
Audio | Waveform | Text |
Modules are separated by modality, not architecture.
| Path | Language | Owner |
|---|---|---|
src/pygpukit/ |
Python | API surface only |
native/ops/ |
C++/CUDA | Kernel implementations |
native/core/ |
C++ | CUDA utilities |
rust/pygpukit-core/ |
Rust | Runtime core |
| Metric | Requirement |
|---|---|
| Regression | Not allowed without explicit justification |
| New kernels | Must include benchmark results |
| TensorCore | Required for FP16/BF16/TF32 on SM >= 80 |
| Memory | No hidden allocations in hot paths |
- Supported: SM 80+ (Ampere, Ada, Hopper, Blackwell)
- Build default: SM 80, 86, 89, 90, 100, 120a
- Unsupported: SM < 80
// DO: L2-friendly, coalesced, vectorized
float4 data = *reinterpret_cast<float4*>(&input[idx]);
// DON'T: Complex shared-memory tiling for Pascal/Turing
__shared__ float tile[32][32]; // Often slower on Ampere- No
cuda-pythonor external Python CUDA dependencies - No secrets in code (API keys, tokens, passwords)
- No force push to main/master
- No skipping pre-commit hooks
- Check existing issues - Your idea may already be discussed
- Read CLAUDE.md - Understand architecture and constraints
- Small changes: Just open a PR
- Large changes: Open an issue first to discuss approach
# 1. Fork and clone
git clone https://github.com/YOUR_USERNAME/PyGPUkit.git
cd PyGPUkit
# 2. Create feature branch
git checkout -b feature/your-feature
# 3. Build (Git Bash)
./build.sh 86 # or 120a for RTX 5090
# 4. Make changes, then run checks
git ls-files "*.py" | xargs python -m ruff check --fix
git ls-files "*.py" | xargs python -m ruff format
python -m mypy src/ --ignore-missing-imports \
--disable-error-code=union-attr \
--disable-error-code=no-redef \
--disable-error-code=no-any-return \
--disable-error-code=attr-defined \
--disable-error-code=assignment \
--disable-error-code=arg-type \
--disable-error-code=index \
--disable-error-code=misc
# 5. Run tests
python -m pytest tests/ -v
# 6. For kernel changes, run benchmarks
python scripts/benchmark.py --quick
# 7. Commit
git commit -m "feat(scope): description"
# 8. Push and create PR
git push origin feature/your-featuretype(scope): short description
Longer description if needed.
For kernel changes:
Benchmark results (RTX 3090 Ti):
- 2048x2048: XX.XX TFLOPS
- 4096x4096: XX.XX TFLOPS
- 8192x8192: XX.XX TFLOPS
Correctness: PASS
Types: feat, fix, perf, refactor, docs, test, build, wip, bench
- All CI checks pass (lint, typecheck, tests)
- No performance regressions (for kernel changes)
- Benchmark results included (for kernel changes)
- Documentation updated if needed
- No breaking changes without discussion
PRs are evaluated on these criteria:
| Criterion | Check |
|---|---|
| CI green | Lint, typecheck, tests pass |
| Architecture | Follows layer model and module boundaries |
| No regressions | Performance equal or better |
| Correctness | Tests pass, no silent failures |
| Criterion | Weight | Notes |
|---|---|---|
| Performance | High | Benchmark numbers required for kernels |
| Code quality | Medium | Clear, minimal, no over-engineering |
| Documentation | Medium | Updated if behavior changes |
| Test coverage | Medium | New features need tests |
- Violates architectural invariants
- Introduces cuda-python or similar dependencies
- Performance regression without justification
- Skips pre-commit checks
- Targets SM < 80
- Automated checks - CI must pass
- Maintainer review - Architecture and code quality
- Benchmark verification - For kernel changes
- Merge - Squash or rebase, clean history
- Open an issue for discussion
- Check CLAUDE.md for detailed architecture docs
- Review existing PRs for examples