Documentation: https://frabazz.github.io/lbfgs-FFNN/
This project implements advanced Quasi-Newton optimization methods, specifically L-BFGS (Limited-memory Broyden–Fletcher–Goldfarb–Shanno) and its stochastic variant S-LBFGS, designed for large-scale minimization problems.
The complete project report is available in report/Report.pdf; for theoretical details, implementation choices and experimental results please refer to this document.
L-BFGS is a Quasi-Newton method that stores a limited history of
-
Mechanism: It stores pairs of vectors
$(s_k, y_k)$ where$s_k = w_{k+1} - w_k$ and$y_k = \nabla F(w_{k+1}) - \nabla F(w_k)$ . -
Two-Loop Recursion: Efficiently computes the direction
$p_k = -H_k^{-1} \nabla F(w_k)$ without forming the matrix$H_k^{-1}$ . - Suitability: Valid for smooth, deterministic problems (e.g., standard optimizations like Rosenbrock function).
A standard stochastic gradient descent implementation is provided as a baseline for comparison.
-
Mechanism: Updates weights using the gradient estimated from a small mini-batch of data samples:
$w_{k+1} = w_k - \eta \nabla f_{S_k}(w_k)$ . -
Suitability: Simple and widely used for training neural networks and large-scale machine learning models, though it may require careful tuning of the learning rate
$\eta$ .
The S-LBFGS implementation follows the algorithm proposed by Moritz et al. (2016). It effectively integrates curvature information into stochastic optimization using a stable Hessian update and variance reduction.
-
Variance Reduction (SVRG framework): To control the noise in the gradient approximation, we use a semi-stochastic gradient:
- Every
$m$ iterations (an epoch), we compute a full gradient$\mu = \nabla F(\tilde{w})$ at a reference point$\tilde{w}$ . - During the inner loop, we update the reference gradient with mini-batch corrections:
- Every
This
-
Stable Hessian Update: Standard BFGS updates are unstable with noisy stochastic gradients. S-LBFGS decouples the Hessian update from the step update:
- Every
$L$ iterations, we compute a stable curvature pair$(s, y)$ using a separate mini-batch$b_H$ . -
$s = \bar{w}_t - \bar{w}_{t-1}$ (difference of averaged iterates). -
$y = \nabla^2 F(\bar{w})(\bar{w}_t - \bar{w}_{t-1})$ is approximated accurately, often using Hessian-Vector Products (HVP).
- Every
-
Hessian-Vector Products (HVP): The curvature vector
$y$ is computed using finite differences (or automatic differentiation) on a mini-batch:
This avoids forming the Hessian matrix while capturing the curvature in the direction
- Parallelization: The code is parallelized through OpenMP. The costly gradient computations and finite-difference HVPs are parallelized across data points.
The codebase is split into two concrete backends that share the same high-level flow (define a network, compute loss/gradients, and run an optimizer), but use different data structures and kernels:
- Core optimizer API:
src/minimizer/hosts shared minimizer utilities and interfaces (full-batch and stochastic base classes, ring buffer). - Optimizers:
src/minimizer/lbfgs.hpp,src/minimizer/bfgs.hpp,src/minimizer/gd.hpp,src/minimizer/s_gd.hpp,src/minimizer/s_lbfgs.hpp,src/minimizer/newton.hpp. - Network stack:
src/network.hppandsrc/layer.hppimplement a dense MLP with flat parameter storage and Eigen-based forward/backward. - Unified training flow:
src/unified_optimization.hpp,src/unified_launcher.hpp, andsrc/network_wrapper.hppprovide backend configuration, dataset setup, and optimizer strategies.
- CUDA primitives:
src/cuda/device_buffer.cuhwraps device memory,src/cuda/cublas_handle.cuhmanages cuBLAS, andsrc/cuda/kernels.cuhhosts custom kernels (activation, loss, etc.). - Network stack:
src/cuda/network.cuhandsrc/cuda/layer.cuhimplement a GPU MLP. Parameters and gradients live in contiguous device buffers; per-layer activations/deltas are allocated per batch; GEMMs use cuBLAS. - Optimizers:
src/cuda/minimizer_base.cuhdefines a CUDA optimizer interface.src/cuda/gd.cuh,src/cuda/sgd.cuh, andsrc/cuda/lbfgs.cuhimplement the training steps, with optional history tracking viasrc/iteration_recorder.hpp. - Unified training flow: uses
src/unified_optimization.hpp,src/unified_launcher.hpp, andsrc/network_wrapper.hpp(CUDA specializations compiled under__CUDACC__).
The directory structure is as follows:
./amsc
├── build/ # Build artifacts
├── src/ # Source code
│ ├── minimizer/ # CPU optimizers + base interfaces
│ ├── cuda/ # CUDA backend (network + optimizers + kernels)
│ ├── enzyme/ # Enzyme PINN experiments
│ ├── network.hpp # CPU MLP utilities (Eigen)
│ ├── layer.hpp # CPU layers/activations
│ ├── network_wrapper.hpp # CPU/CUDA wrapper
│ ├── unified_optimization.hpp
│ └── unified_launcher.hpp
├── tests/
│ ├── mnist/ # CPU/GPU MNIST runners
│ ├── fashion-mnist/ # CPU/GPU Fashion-MNIST runners
│ ├── burgers/ # PDE/PINN tests
│ └── pytorch/ # PyTorch reference scripts
└── ...
We use CMake for build configuration.
If you prefer a docker setup, we provide a Docker image that installs all dependencies and lets you build/run inside the container. See enviroment/README.md.
- CMake >= 3.18
- A C++20 compiler
- Eigen3 (required)
- OpenMP (optional, enables Eigen multithreading)
- CUDA toolkit + cuBLAS (optional, for GPU targets)
mkdir -p build
cd build
cmake .. -DENABLE_CUDA=OFF
makemkdir -p build
cd build
cmake .. -DENABLE_CUDA=ON
makeNotes: If CUDA is enabled but no CUDA compiler is found, CUDA targets are skipped.
./build/test_runner
./build/test_runner_autodiff
./build/test_mnist_cpu
./build/test_fashion_cpu./build/test_mnist_gpu
./build/test_fashion_gpu
./build/test_fashion_gpu_deepStandalone CMake is available under tests/burgers to keep the main project build unchanged. This target requires Clang and the Enzyme plugin.
Before building the Burgers tests, make sure Enzyme is compiled following the official guide https://enzyme.mit.edu/Installation/ or by building the copy under lib/Enzyme/enzyme:
mkdir build && cd build
cmake -G Ninja .. -DLLVM_DIR=/path/to/llvm/lib/cmake/llvm
ninja
ninja check-enzymeWe tested on clang++-15, but this should work with Clang 11-16.
cmake -S tests/burgers -B build-burgers \
-DCMAKE_CXX_COMPILER=clang++-19 \
-DENZYME_PLUGIN_PATH=/path/to/ClangEnzyme-19.so
cmake --build build-burgers -j./build-burgers/test_burgers_parallel