Nice work!
Have you tried using M-RoPE? It allows using a unified positional encoding space for patches and 1D tokens. From my (small-scale, limited) testing, it makes for very good dynamic res/token-count 1D tokenizers, with RoPE's extrapolation ability as a bonus. I've also found RoPE to be more stable under GAN training than learned pos-emb.
I have a reference video-tokenizer codebase here that uses sample packing for dynamic-resolution training. It uses M-RoPE+GQA for adaptability and low compute together, leaving it as a mostly standard ViT.
Nice work!
Have you tried using M-RoPE? It allows using a unified positional encoding space for patches and 1D tokens. From my (small-scale, limited) testing, it makes for very good dynamic res/token-count 1D tokenizers, with RoPE's extrapolation ability as a bonus. I've also found RoPE to be more stable under GAN training than learned pos-emb.
I have a reference video-tokenizer codebase here that uses sample packing for dynamic-resolution training. It uses M-RoPE+GQA for adaptability and low compute together, leaving it as a mostly standard ViT.