GoNN

A pure-Go deep learning and machine learning framework with PyTorch-style autograd, neural-network layers, optimizers, classical ML algorithms, and an optional CUDA backend.

What is GoNN?

GoNN is a single-binary, dependency-light alternative to PyTorch / tinygrad / TensorFlow, written in Go. It provides:

Autograd Tensor — flat-buffer *tensor.Tensor with shape + strides + automatic differentiation by reverse-mode graph traversal.
nn package —
- Linear & conv: Linear, Conv1d, Conv2d, Conv3d, ConvTranspose1d/2d/3d, Embedding.
- Pooling: MaxPool2d, AvgPool2d, AdaptiveMaxPool1d/2d/3d, AdaptiveAvgPool1d/2d/3d.
- Normalization: BatchNorm1d, BatchNorm2d, LayerNorm, GroupNorm, RMSNorm, InstanceNorm1d/2d.
- Padding & upsample: ZeroPad2d, ConstantPad2d, ReflectionPad2d, ReplicationPad2d, Upsample, PixelShuffle/PixelUnshuffle.
- Recurrent: RNN, LSTM, GRU (single layer), RNNCell, LSTMCell, GRUCell, MultiLayerRNN, MultiLayerLSTM, MultiLayerGRU (multi-layer + bidirectional), Seq2Seq.
- Attention/Transformer: MultiHeadAttention (optional causal mask), TransformerEncoderLayer/TransformerEncoder, TransformerDecoderLayer/TransformerDecoder.
- Containers: Sequential, Dropout.
- Parametric/gated activations as modules: PReLU (learnable slope), GLU.
optim package — SGD (momentum/Nesterov), Adam, AdamW, RMSprop, Adagrad, Adadelta, NAdam, Adamax, RAdam, LBFGS (closure-style), Rprop, plus LR schedulers (StepLR, MultiStepLR, ExponentialLR, CosineAnnealingLR, LinearLR, PolynomialLR, ChainedScheduler, SequentialLR, CyclicLR, ReduceLROnPlateau, OneCycleLR).
ml package — classical algorithms:
- Linear: LinearRegression, Ridge, Lasso, ElasticNet, BayesianRidge, LogisticRegression.
- Discriminant: LinearDiscriminantAnalysis (LDA, with Fisher transform), QuadraticDiscriminantAnalysis (QDA).
- Trees & ensembles: DecisionTreeClassifier/Regressor, RandomForestClassifier/Regressor, ExtraTreesClassifier/Regressor, AdaBoostClassifier, GradientBoostingClassifier/Regressor, IsolationForest.
- Neighbors: KNNClassifier, KNNRegressor.
- SVM: LinearSVC.
- Naive Bayes: GaussianNB, MultinomialNB, BernoulliNB.
- Clustering: KMeans, DBSCAN, AgglomerativeClustering, MeanShift, GaussianMixture.
- Dim. reduction: PCA, KernelPCA, FastICA, TSNE.
- Preprocessing: StandardScaler, MinMaxScaler, OneHotEncoder, LabelEncoder, PolynomialFeatures.
- Metrics + model selection: Accuracy, Precision/Recall/F1, ConfusionMatrix, MSE/MAE/R²/SilhouetteScore/ROCAUC, TrainTestSplit, KFold, CrossValScore.
data package — Dataset, DataLoader, transforms, MNIST/CSV loaders, synthetic dataset generators (MakeRegression, MakeClassification, MakeBlobs, MakeMoons).
backend package — pluggable compute backend with three implementations: CPU (gonum BLAS), CUDA (-tags cuda, CGO + cuBLAS, wired into tensor.MatMul so real models run GEMMs on the GPU; verified on an RTX 3060 and benchmarked vs PyTorch/TensorFlow/tinygrad — see Benchmarks), and OpenCL (-tags opencl, CGO + fp64 kernels; numerically verified against the CPU backend). One backend.Backend interface; callers don't change.
Tensor dtypes — Float64 (default), Float32, Float16 with correct IEEE-754 precision/range semantics (x.To(tensor.Float16), numpy-style type promotion). Emulated on float64 storage for correct mixed-precision numerics; for real fp16 storage + tensor-core compute, use the GPU DeviceBufferF16 path (fp16 GEMM at ~24 TFLOP/s on a 3060).
Fused CUDA flash-attention (forward + backward) — a custom fp64 flash-attention kernel (online softmax, no S×S materialization) wired into nn.MultiHeadAttention. On causal fp64 attention it is ~1.4–1.5× faster than PyTorch's SDPA, and its fused backward (gradcheck ≈ 5e-8) lets the differentiable Forward train on the kernel — not just run inference.

Everything compiles to a single static Go binary (no Python runtime).

Install

go get github.com/Shivangx01b/GoNN

Quickstart — Tensor + Autograd

package main

import (
    "fmt"
    "gonn/tensor"
)

func main() {
    x := tensor.New([]float64{1, 2, 3}, 3, 1).SetRequiresGrad(true)
    W := tensor.New([]float64{2, -1, 0.5}, 1, 3).SetRequiresGrad(true)

    y := W.MatMul(x).Square().Sum()
    y.Backward()

    fmt.Println("y      =", y)
    fmt.Println("dy/dx  =", x.Grad)  // [6, -3, 1.5]
    fmt.Println("dy/dW  =", W.Grad)  // [3, 6, 9]
}

Quickstart — Linear Regression with SGD

See examples/regression:

W := tensor.Randn(1, 1).SetRequiresGrad(true)
b := tensor.Zeros(1).SetRequiresGrad(true)
opt := optim.NewSGD([]*tensor.Tensor{W, b}, 0.01)

for epoch := 0; epoch < 200; epoch++ {
    opt.ZeroGrad()
    pred := X.MatMul(W).Add(b)
    loss := pred.Sub(Y).Square().Mean()
    loss.Backward()
    opt.Step()
}

Quickstart — MLP Classifier (PyTorch-style)

import (
    "gonn/nn"
    "gonn/optim"
    "gonn/tensor"
)

model := nn.Sequential(
    nn.NewLinear(784, 256, true),
    nn.ReLU{},
    nn.NewLinear(256, 64, true),
    nn.ReLU{},
    nn.NewLinear(64, 10, true),
)

opt := optim.NewAdam(model.Parameters(), 1e-3)

for epoch := 0; epoch < 10; epoch++ {
    for batch := range loader.Iter() {
        opt.ZeroGrad()
        logits := model.Forward(batch.X)
        loss := nn.CrossEntropyLoss(logits, batch.Y)
        loss.Backward()
        opt.Step()
    }
}

Quickstart — Classical ML

import "gonn/ml"

// K-means clustering
km := ml.NewKMeans(3, 100, 1e-4)
km.Fit(X)
labels := km.Predict(X)

// Random forest classification
rf := ml.NewRandomForestClassifier(100, 10, 0)
rf.Fit(Xtr, ytr)
yhat := rf.Predict(Xte)

// Gradient boosting regression
gb := ml.NewGradientBoostingRegressor(100, 0.1, 3)
gb.Fit(Xtr, ytr)

Tensor reference

Category	Methods
Construct	`New`, `Zeros`, `Ones`, `Full`, `Randn`, `Uniform`, `Arange`, `Eye`, `Scalar`
Arithmetic	`Add`, `Sub`, `Mul`, `Div`, `MatMul`, `Neg`, scalar variants `AddScalar`, …
Unary	`Exp`, `Log`, `Sqrt`, `Sin`, `Cos`, `Tan`, `Abs`, `Reciprocal`, `Pow`, `Square`, `Clip`
Reduction	`Sum`, `Mean`, `Max`, `Min`, `SumAxis`, `MeanAxis`, `MaxAxis`, `MinAxis`, `ArgMax`, `ArgMin`
Shape	`Reshape`, `View`, `Flatten`, `Transpose`, `T`, `Permute`, `Squeeze`, `Unsqueeze`, `Expand`, `Concat`, `Stack`
Activation	`ReLU`, `LeakyReLU`, `ELU`, `SELU`, `CELU`, `Sigmoid`, `Tanh`, `LogSigmoid`, `HardTanh`, `HardSigmoid`, `Softplus`, `Softsign`, `GELU`, `SiLU` (Swish), `HardSwish`, `Mish`, `ReLU6`, `Hardshrink`, `Softshrink`, `Tanhshrink`, `Threshold`, `RReLU`, `Softmax`, `LogSoftmax`
Autograd	`SetRequiresGrad`, `Backward`, `ZeroGrad`, `.Grad`

CUDA backend

The default build is pure-Go CPU. To compile against CUDA:

# 1. Build the native library
cd backend/cuda
nvcc -O3 -Xcompiler -fPIC -shared gonn_cuda.cu -o libgonn_cuda.so -lcublas

# 2. Build GoNN with the cuda tag
go build -tags cuda ./...

The CUDA implementation lives in backend/cuda/gonn_cuda.cu. The Go side calls into it via CGO and uses cuBLAS for matmul. The CPU and CUDA backends share the same backend.Backend interface so callers do not need to change.

The GPU backend currently accelerates:

Category	Ops
MatMul	`MatMul` (cuBLAS `Dgemm`/`Sgemm`) — dispatched from `tensor.MatMul`
Elementwise	`AddElem`, `MulElem`, `Sub`, `Div`, `Scale`, `AxpyInto` (in-place `out += alpha*x`)
Reductions	`Sum`, `Max` (single-block tree reduce in shared memory)
Activations	`ReLU`, `Sigmoid`, `Tanh`, `Exp`, `Log`, `GELU` (tanh approx.), `SiLU` (Swish)
Attention	Fused fp64 flash-attention forward (`flash_attn_f64_tiled`, online softmax)

The CUDA backend is verified for correctness against the CPU backend on the GPU (matmul maxAbsDiff ≈ 7e-16). The tensor-op path copies host↔device per call (no device buffer caching yet); the device-resident benchmark path keeps inputs on the GPU and is the apples-to-apples comparison against other frameworks.

Reproducible GPU build (Docker, no local CUDA toolkit needed)

docker build -f benchmark/docker/Dockerfile.cuda -t gonn-cuda .
docker run --rm --gpus all -v "$PWD":/work -w /work gonn-cuda \
    bash benchmark/docker/build_and_run.sh

This compiles gonn_cuda.cu with nvcc, builds GoNN -tags cuda, verifies correctness on the GPU, then runs the matmul, elementwise, flash-attention, and MultiHeadAttention.ForwardFused benchmarks.

OpenCL backend (`-tags opencl`)

The OpenCL backend (backend/opencl, fp64 kernels mirroring the CUDA ones) is numerically verified against the CPU backend:

docker run --rm --gpus all -v "$PWD":/work -w /work gonn-cuda \
    bash benchmark/docker/opencl_run.sh

It runs on any OpenCL device. The verification uses the portable oclgrind fp64 runtime because this machine's Docker/WSL2 GPU passthrough does not inject NVIDIA's OpenCL driver; the same binary runs on the GPU wherever a real GPU OpenCL ICD is present (native Linux + NVIDIA driver, or a Windows host).

Benchmarks

Run live on one machine (12-core CPU, RTX 3060) vs PyTorch 2.7.1+cu128, TensorFlow 2.20, tinygrad 0.13, with matched CUDA-event timing. Full methodology and tables: benchmark/REPORT.md and benchmark/RESULTS.md. Honest highlights:

op (N=2048 / shape)	GoNN	PyTorch	tinygrad	verdict
causal attention f64 (GFLOP/s)	~87–96	58–66	—	GoNN wins ~1.4–1.5× (fused kernel)
matmul f32 GPU (GFLOP/s)	~7,700–8,150	~7,600	1,104 (OpenCL)	GoNN ≈ PyTorch; ~7× tinygrad
matmul f64 GPU (GFLOP/s)	~170–179	~174	176	three-way tie (all cuBLAS)
fused attention in `nn.MultiHeadAttention`	trains (gradcheck ≈5e-8) + inference on GPU	—	—	fwd+bwd kernel
matmul f64 CPU (GFLOP/s)	40 (gonum)	166 (MKL)	—	PyTorch (MKL)

Honest summary: GoNN's GPU matmul is on par with PyTorch (both lean on cuBLAS), it beats PyTorch ~1.5× on causal fp64 attention via a custom fused kernel, and its CPU matmul is 17× faster than the old naive loop but still behind MKL (pure Go has no SIMD intrinsics). GoNN is not "faster than PyTorch everywhere" — see the report for where it wins, ties, and loses.

Project layout

GoNN/
├── tensor/        # Core Tensor + autograd
├── nn/            # Layers, losses, init, activations as modules
├── optim/         # Optimizers + LR schedulers
├── ml/            # Classical ML algorithms
├── data/          # Datasets, DataLoader, transforms
├── backend/       # CPU (gonum BLAS) / CUDA backend contract
│   └── cuda/      # CUDA kernels + fused flash-attention (build tag `cuda`)
├── benchmark/     # Cross-framework benchmarks + Docker GPU build + report
├── examples/      # Runnable demos
└── main.go        # Top-level smoke test

Loss reference

MSELoss, MAELoss / L1Loss, SmoothL1Loss, HuberLoss, CrossEntropyLoss, NLLLoss, BCELoss, BCEWithLogitsLoss, KLDivLoss, PoissonNLLLoss, GaussianNLLLoss, MarginRankingLoss, HingeEmbeddingLoss, CosineEmbeddingLoss, TripletMarginLoss, MultiMarginLoss.

Examples

Runnable demos under examples/:

examples/regression — linear regression with SGD.
examples/mlp — 3-class MLP classifier with Adam (reaches 100%).
examples/cnn — Conv2d + MaxPool2d + AdaptiveAvgPool2d image classifier.
examples/transformer — small transformer encoder + classification head.
examples/ml_classical — LinearRegression + KMeans + PCA.

Status

The tensor + autograd core, the full NN layer catalogue (linear, conv 1/2/3-d, conv-transpose, pooling/adaptive pooling, normalization, padding, upsample, RNN/LSTM/GRU with cells and bidirectional/multi-layer variants, Seq2Seq, attention, transformer encoder/decoder), all common optimizers and schedulers, and the classical ML catalogue (linear, discriminant, trees, ensembles, boosting, isolation forest, SVM, NB, KNN, clustering, dimensionality reduction, preprocessing, metrics) are implemented and tested. The compute backend is pluggable: CPU (gonum BLAS) by default, CUDA (cuBLAS + custom kernels, incl. a fused fp64 flash-attention) via -tags cuda, verified on GPU and benchmarked against PyTorch/TensorFlow/tinygrad. Coverage of more exotic corners (sparse ops, distributed training, JIT, CTC) is intentionally not pursued.

Recent correctness fixes: Concat/Stack are now autograd-aware (previously dropped gradients), BCELoss clamps to avoid log(0), BatchNorm tracks the unbiased running variance (PyTorch parity), and the CUDA backend now compiles (a constant -1.0/0.0 had blocked it).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GoNN

What is GoNN?

Install

Quickstart — Tensor + Autograd

Quickstart — Linear Regression with SGD

Quickstart — MLP Classifier (PyTorch-style)

Quickstart — Classical ML

Tensor reference

CUDA backend

Reproducible GPU build (Docker, no local CUDA toolkit needed)

OpenCL backend (`-tags opencl`)

Benchmarks

Project layout

Loss reference

Examples

Status

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
backend		backend
benchmark		benchmark
data		data
examples		examples
ml		ml
nn		nn
optim		optim
static		static
tensor		tensor
.gitignore		.gitignore
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go

Folders and files

Latest commit

History

Repository files navigation

GoNN

What is GoNN?

Install

Quickstart — Tensor + Autograd

Quickstart — Linear Regression with SGD

Quickstart — MLP Classifier (PyTorch-style)

Quickstart — Classical ML

Tensor reference

CUDA backend

Reproducible GPU build (Docker, no local CUDA toolkit needed)

OpenCL backend (-tags opencl)

Benchmarks

Project layout

Loss reference

Examples

Status

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

OpenCL backend (`-tags opencl`)

Packages