A pure-Go deep learning and machine learning framework with PyTorch-style autograd, neural-network layers, optimizers, classical ML algorithms, and an optional CUDA backend.
GoNN is a single-binary, dependency-light alternative to PyTorch / tinygrad / TensorFlow, written in Go. It provides:
- Autograd Tensor — flat-buffer
*tensor.Tensorwith shape + strides + automatic differentiation by reverse-mode graph traversal. nnpackage —- Linear & conv:
Linear,Conv1d,Conv2d,Conv3d,ConvTranspose1d/2d/3d,Embedding. - Pooling:
MaxPool2d,AvgPool2d,AdaptiveMaxPool1d/2d/3d,AdaptiveAvgPool1d/2d/3d. - Normalization:
BatchNorm1d,BatchNorm2d,LayerNorm,GroupNorm,RMSNorm,InstanceNorm1d/2d. - Padding & upsample:
ZeroPad2d,ConstantPad2d,ReflectionPad2d,ReplicationPad2d,Upsample,PixelShuffle/PixelUnshuffle. - Recurrent:
RNN,LSTM,GRU(single layer),RNNCell,LSTMCell,GRUCell,MultiLayerRNN,MultiLayerLSTM,MultiLayerGRU(multi-layer + bidirectional),Seq2Seq. - Attention/Transformer:
MultiHeadAttention(optional causal mask),TransformerEncoderLayer/TransformerEncoder,TransformerDecoderLayer/TransformerDecoder. - Containers:
Sequential,Dropout. - Parametric/gated activations as modules:
PReLU(learnable slope),GLU.
- Linear & conv:
optimpackage — SGD (momentum/Nesterov), Adam, AdamW, RMSprop, Adagrad, Adadelta, NAdam, Adamax, RAdam, LBFGS (closure-style), Rprop, plus LR schedulers (StepLR, MultiStepLR, ExponentialLR, CosineAnnealingLR, LinearLR, PolynomialLR, ChainedScheduler, SequentialLR, CyclicLR, ReduceLROnPlateau, OneCycleLR).mlpackage — classical algorithms:- Linear: LinearRegression, Ridge, Lasso, ElasticNet, BayesianRidge, LogisticRegression.
- Discriminant: LinearDiscriminantAnalysis (LDA, with Fisher transform), QuadraticDiscriminantAnalysis (QDA).
- Trees & ensembles: DecisionTreeClassifier/Regressor, RandomForestClassifier/Regressor, ExtraTreesClassifier/Regressor, AdaBoostClassifier, GradientBoostingClassifier/Regressor, IsolationForest.
- Neighbors: KNNClassifier, KNNRegressor.
- SVM: LinearSVC.
- Naive Bayes: GaussianNB, MultinomialNB, BernoulliNB.
- Clustering: KMeans, DBSCAN, AgglomerativeClustering, MeanShift, GaussianMixture.
- Dim. reduction: PCA, KernelPCA, FastICA, TSNE.
- Preprocessing: StandardScaler, MinMaxScaler, OneHotEncoder, LabelEncoder, PolynomialFeatures.
- Metrics + model selection: Accuracy, Precision/Recall/F1, ConfusionMatrix, MSE/MAE/R²/SilhouetteScore/ROCAUC, TrainTestSplit, KFold, CrossValScore.
datapackage —Dataset,DataLoader, transforms, MNIST/CSV loaders, synthetic dataset generators (MakeRegression,MakeClassification,MakeBlobs,MakeMoons).backendpackage — pluggable compute backend with three implementations: CPU (gonum BLAS), CUDA (-tags cuda, CGO + cuBLAS, wired intotensor.MatMulso real models run GEMMs on the GPU; verified on an RTX 3060 and benchmarked vs PyTorch/TensorFlow/tinygrad — see Benchmarks), and OpenCL (-tags opencl, CGO + fp64 kernels; numerically verified against the CPU backend). Onebackend.Backendinterface; callers don't change.- Tensor dtypes —
Float64(default),Float32,Float16with correct IEEE-754 precision/range semantics (x.To(tensor.Float16), numpy-style type promotion). Emulated on float64 storage for correct mixed-precision numerics; for real fp16 storage + tensor-core compute, use the GPUDeviceBufferF16path (fp16 GEMM at ~24 TFLOP/s on a 3060). - Fused CUDA flash-attention (forward + backward) — a custom fp64 flash-attention kernel (online softmax, no S×S materialization) wired into
nn.MultiHeadAttention. On causal fp64 attention it is ~1.4–1.5× faster than PyTorch's SDPA, and its fused backward (gradcheck ≈ 5e-8) lets the differentiableForwardtrain on the kernel — not just run inference.
Everything compiles to a single static Go binary (no Python runtime).
go get github.com/Shivangx01b/GoNNpackage main
import (
"fmt"
"gonn/tensor"
)
func main() {
x := tensor.New([]float64{1, 2, 3}, 3, 1).SetRequiresGrad(true)
W := tensor.New([]float64{2, -1, 0.5}, 1, 3).SetRequiresGrad(true)
y := W.MatMul(x).Square().Sum()
y.Backward()
fmt.Println("y =", y)
fmt.Println("dy/dx =", x.Grad) // [6, -3, 1.5]
fmt.Println("dy/dW =", W.Grad) // [3, 6, 9]
}See examples/regression:
W := tensor.Randn(1, 1).SetRequiresGrad(true)
b := tensor.Zeros(1).SetRequiresGrad(true)
opt := optim.NewSGD([]*tensor.Tensor{W, b}, 0.01)
for epoch := 0; epoch < 200; epoch++ {
opt.ZeroGrad()
pred := X.MatMul(W).Add(b)
loss := pred.Sub(Y).Square().Mean()
loss.Backward()
opt.Step()
}import (
"gonn/nn"
"gonn/optim"
"gonn/tensor"
)
model := nn.Sequential(
nn.NewLinear(784, 256, true),
nn.ReLU{},
nn.NewLinear(256, 64, true),
nn.ReLU{},
nn.NewLinear(64, 10, true),
)
opt := optim.NewAdam(model.Parameters(), 1e-3)
for epoch := 0; epoch < 10; epoch++ {
for batch := range loader.Iter() {
opt.ZeroGrad()
logits := model.Forward(batch.X)
loss := nn.CrossEntropyLoss(logits, batch.Y)
loss.Backward()
opt.Step()
}
}import "gonn/ml"
// K-means clustering
km := ml.NewKMeans(3, 100, 1e-4)
km.Fit(X)
labels := km.Predict(X)
// Random forest classification
rf := ml.NewRandomForestClassifier(100, 10, 0)
rf.Fit(Xtr, ytr)
yhat := rf.Predict(Xte)
// Gradient boosting regression
gb := ml.NewGradientBoostingRegressor(100, 0.1, 3)
gb.Fit(Xtr, ytr)| Category | Methods |
|---|---|
| Construct | New, Zeros, Ones, Full, Randn, Uniform, Arange, Eye, Scalar |
| Arithmetic | Add, Sub, Mul, Div, MatMul, Neg, scalar variants AddScalar, … |
| Unary | Exp, Log, Sqrt, Sin, Cos, Tan, Abs, Reciprocal, Pow, Square, Clip |
| Reduction | Sum, Mean, Max, Min, SumAxis, MeanAxis, MaxAxis, MinAxis, ArgMax, ArgMin |
| Shape | Reshape, View, Flatten, Transpose, T, Permute, Squeeze, Unsqueeze, Expand, Concat, Stack |
| Activation | ReLU, LeakyReLU, ELU, SELU, CELU, Sigmoid, Tanh, LogSigmoid, HardTanh, HardSigmoid, Softplus, Softsign, GELU, SiLU (Swish), HardSwish, Mish, ReLU6, Hardshrink, Softshrink, Tanhshrink, Threshold, RReLU, Softmax, LogSoftmax |
| Autograd | SetRequiresGrad, Backward, ZeroGrad, .Grad |
The default build is pure-Go CPU. To compile against CUDA:
# 1. Build the native library
cd backend/cuda
nvcc -O3 -Xcompiler -fPIC -shared gonn_cuda.cu -o libgonn_cuda.so -lcublas
# 2. Build GoNN with the cuda tag
go build -tags cuda ./...The CUDA implementation lives in backend/cuda/gonn_cuda.cu. The Go side calls into it via CGO and uses cuBLAS for matmul. The CPU and CUDA backends share the same backend.Backend interface so callers do not need to change.
The GPU backend currently accelerates:
| Category | Ops |
|---|---|
| MatMul | MatMul (cuBLAS Dgemm/Sgemm) — dispatched from tensor.MatMul |
| Elementwise | AddElem, MulElem, Sub, Div, Scale, AxpyInto (in-place out += alpha*x) |
| Reductions | Sum, Max (single-block tree reduce in shared memory) |
| Activations | ReLU, Sigmoid, Tanh, Exp, Log, GELU (tanh approx.), SiLU (Swish) |
| Attention | Fused fp64 flash-attention forward (flash_attn_f64_tiled, online softmax) |
The CUDA backend is verified for correctness against the CPU backend on the GPU
(matmul maxAbsDiff ≈ 7e-16). The tensor-op path copies host↔device per call
(no device buffer caching yet); the device-resident benchmark path keeps inputs
on the GPU and is the apples-to-apples comparison against other frameworks.
docker build -f benchmark/docker/Dockerfile.cuda -t gonn-cuda .
docker run --rm --gpus all -v "$PWD":/work -w /work gonn-cuda \
bash benchmark/docker/build_and_run.shThis compiles gonn_cuda.cu with nvcc, builds GoNN -tags cuda, verifies
correctness on the GPU, then runs the matmul, elementwise, flash-attention, and
MultiHeadAttention.ForwardFused benchmarks.
The OpenCL backend (backend/opencl, fp64 kernels mirroring the CUDA ones) is
numerically verified against the CPU backend:
docker run --rm --gpus all -v "$PWD":/work -w /work gonn-cuda \
bash benchmark/docker/opencl_run.shIt runs on any OpenCL device. The verification uses the portable oclgrind
fp64 runtime because this machine's Docker/WSL2 GPU passthrough does not inject
NVIDIA's OpenCL driver; the same binary runs on the GPU wherever a real GPU
OpenCL ICD is present (native Linux + NVIDIA driver, or a Windows host).
Run live on one machine (12-core CPU, RTX 3060) vs PyTorch 2.7.1+cu128,
TensorFlow 2.20, tinygrad 0.13, with matched CUDA-event timing. Full methodology
and tables: benchmark/REPORT.md and
benchmark/RESULTS.md. Honest highlights:
| op (N=2048 / shape) | GoNN | PyTorch | tinygrad | verdict |
|---|---|---|---|---|
| causal attention f64 (GFLOP/s) | ~87–96 | 58–66 | — | GoNN wins ~1.4–1.5× (fused kernel) |
| matmul f32 GPU (GFLOP/s) | ~7,700–8,150 | ~7,600 | 1,104 (OpenCL) | GoNN ≈ PyTorch; ~7× tinygrad |
| matmul f64 GPU (GFLOP/s) | ~170–179 | ~174 | 176 | three-way tie (all cuBLAS) |
fused attention in nn.MultiHeadAttention |
trains (gradcheck ≈5e-8) + inference on GPU | — | — | fwd+bwd kernel |
| matmul f64 CPU (GFLOP/s) | 40 (gonum) | 166 (MKL) | — | PyTorch (MKL) |
Honest summary: GoNN's GPU matmul is on par with PyTorch (both lean on cuBLAS), it beats PyTorch ~1.5× on causal fp64 attention via a custom fused kernel, and its CPU matmul is 17× faster than the old naive loop but still behind MKL (pure Go has no SIMD intrinsics). GoNN is not "faster than PyTorch everywhere" — see the report for where it wins, ties, and loses.
GoNN/
├── tensor/ # Core Tensor + autograd
├── nn/ # Layers, losses, init, activations as modules
├── optim/ # Optimizers + LR schedulers
├── ml/ # Classical ML algorithms
├── data/ # Datasets, DataLoader, transforms
├── backend/ # CPU (gonum BLAS) / CUDA backend contract
│ └── cuda/ # CUDA kernels + fused flash-attention (build tag `cuda`)
├── benchmark/ # Cross-framework benchmarks + Docker GPU build + report
├── examples/ # Runnable demos
└── main.go # Top-level smoke test
MSELoss, MAELoss / L1Loss, SmoothL1Loss, HuberLoss, CrossEntropyLoss, NLLLoss, BCELoss, BCEWithLogitsLoss, KLDivLoss, PoissonNLLLoss, GaussianNLLLoss, MarginRankingLoss, HingeEmbeddingLoss, CosineEmbeddingLoss, TripletMarginLoss, MultiMarginLoss.
Runnable demos under examples/:
examples/regression— linear regression with SGD.examples/mlp— 3-class MLP classifier with Adam (reaches 100%).examples/cnn— Conv2d + MaxPool2d + AdaptiveAvgPool2d image classifier.examples/transformer— small transformer encoder + classification head.examples/ml_classical— LinearRegression + KMeans + PCA.
The tensor + autograd core, the full NN layer catalogue (linear, conv 1/2/3-d, conv-transpose, pooling/adaptive pooling, normalization, padding, upsample, RNN/LSTM/GRU with cells and bidirectional/multi-layer variants, Seq2Seq, attention, transformer encoder/decoder), all common optimizers and schedulers, and the classical ML catalogue (linear, discriminant, trees, ensembles, boosting, isolation forest, SVM, NB, KNN, clustering, dimensionality reduction, preprocessing, metrics) are implemented and tested. The compute backend is pluggable: CPU (gonum BLAS) by default, CUDA (cuBLAS + custom kernels, incl. a fused fp64 flash-attention) via -tags cuda, verified on GPU and benchmarked against PyTorch/TensorFlow/tinygrad. Coverage of more exotic corners (sparse ops, distributed training, JIT, CTC) is intentionally not pursued.
Recent correctness fixes: Concat/Stack are now autograd-aware (previously dropped gradients), BCELoss clamps to avoid log(0), BatchNorm tracks the unbiased running variance (PyTorch parity), and the CUDA backend now compiles (a constant -1.0/0.0 had blocked it).
