MoT or not?

Thanks for the great work. In the paper you mentioned that you used MoT architecture so you can have action and video in the same space. In MoT architecture, different modalities have different qkv and they share the attention mechanism.

But when I check the layers in the released code base, both action and video share the same qkv. Am I wrong somewhere?