Model/Pipeline/Scheduler description
Model description
JoyAI-Image is a unified multimodal foundation model for image understanding, text-to-image generation, and instruction-guided image editing from JD.com. It combines an 8B
Multimodal Large Language Model (MLLM) with a 16B Multimodal Diffusion Transformer (MMDiT).
JoyAI-Image Edit Plus is the multi-image instruction-guided editing variant. Unlike the single-image JoyAI-Image Edit (added in #XXXX), Edit Plus accepts 1–6 reference images and a
text instruction to generate a new image that combines elements from the references.
Key architectural differences from Edit:
- Patchified 6D latent representation: Input images and target noise are independently VAE-encoded and patchified into [B, max_patches, C, pt, ph, pw] format with a target_mask to
distinguish target noise from reference patches.
- Variable reference images: Supports 1–6 reference images per sample via dynamic shape_list.
- Batched RoPE: Per-component rotary position embeddings with temporal offsets for each reference image and the target.
- Norm-guided CFG: Classifier-free guidance with norm rescaling in a single forward pass.
Open source status
Provide useful links for the implementation
Model/Pipeline/Scheduler description
Model description
JoyAI-Image is a unified multimodal foundation model for image understanding, text-to-image generation, and instruction-guided image editing from JD.com. It combines an 8B
Multimodal Large Language Model (MLLM) with a 16B Multimodal Diffusion Transformer (MMDiT).
JoyAI-Image Edit Plus is the multi-image instruction-guided editing variant. Unlike the single-image JoyAI-Image Edit (added in #XXXX), Edit Plus accepts 1–6 reference images and a
text instruction to generate a new image that combines elements from the references.
Key architectural differences from Edit:
distinguish target noise from reference patches.
Open source status
Provide useful links for the implementation