Skip to content

Implement Galerkin Transformer for Operator Learning #116

@ChrisRackauckas-Claude

Description

@ChrisRackauckas-Claude

Summary

Implement the Galerkin Transformer, which replaces softmax attention with linear attention inspired by Petrov-Galerkin projection for PDE operator learning.

Reference

  • Cao, "Choose a Transformer: Fourier or Galerkin," NeurIPS 2021. arXiv:2105.14995

Description

The Galerkin Transformer removes softmax normalization from attention and uses Q(K^T V) (Galerkin-type) or (QK^T)V (Fourier-type) attention, which mimics Petrov-Galerkin projection in finite element methods. This achieves significant improvements in training cost and accuracy compared to softmax-normalized counterparts for operator learning tasks.

Key features:

  • Linear attention (no softmax) with O(n) complexity
  • Galerkin-type: Q(K^T V) — analogous to Petrov-Galerkin projection
  • Fourier-type: (QK^T)V — analogous to Fourier integral operator

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions