Reducing architecture complexity with M-RoPE

Nice work!
Have you tried using [M-RoPE](https://arxiv.org/pdf/2409.12191)? It allows using a unified positional encoding space for patches and 1D tokens. From my (small-scale, limited) testing, it makes for very good dynamic res/token-count 1D tokenizers, with RoPE's extrapolation ability as a bonus. I've also found RoPE to be more stable under GAN training than learned pos-emb.

I have a reference video-tokenizer codebase [here](https://github.com/NilanEkanayake/TiTok-Video) that uses sample packing for dynamic-resolution training. It uses M-RoPE+GQA for adaptability and low compute together, leaving it as a mostly standard ViT.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reducing architecture complexity with M-RoPE #3

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Reducing architecture complexity with M-RoPE #3

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions