What model architectures have you used to encode text and images?
What model architectures have you used to encode text and images?