-
-
Notifications
You must be signed in to change notification settings - Fork 12
Open
Description
Summary
Implement the Galerkin Transformer, which replaces softmax attention with linear attention inspired by Petrov-Galerkin projection for PDE operator learning.
Reference
- Cao, "Choose a Transformer: Fourier or Galerkin," NeurIPS 2021. arXiv:2105.14995
Description
The Galerkin Transformer removes softmax normalization from attention and uses Q(K^T V) (Galerkin-type) or (QK^T)V (Fourier-type) attention, which mimics Petrov-Galerkin projection in finite element methods. This achieves significant improvements in training cost and accuracy compared to softmax-normalized counterparts for operator learning tasks.
Key features:
- Linear attention (no softmax) with O(n) complexity
- Galerkin-type: Q(K^T V) — analogous to Petrov-Galerkin projection
- Fourier-type: (QK^T)V — analogous to Fourier integral operator
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels