Skip to content

Add JoyAI-Image Edit Plus pipeline and model #14049

Description

@tangyanf

Model/Pipeline/Scheduler description

Model description

JoyAI-Image is a unified multimodal foundation model for image understanding, text-to-image generation, and instruction-guided image editing from JD.com. It combines an 8B
Multimodal Large Language Model (MLLM) with a 16B Multimodal Diffusion Transformer (MMDiT).

JoyAI-Image Edit Plus is the multi-image instruction-guided editing variant. Unlike the single-image JoyAI-Image Edit (added in #XXXX), Edit Plus accepts 1–6 reference images and a
text instruction to generate a new image that combines elements from the references.

Key architectural differences from Edit:

  • Patchified 6D latent representation: Input images and target noise are independently VAE-encoded and patchified into [B, max_patches, C, pt, ph, pw] format with a target_mask to
    distinguish target noise from reference patches.
  • Variable reference images: Supports 1–6 reference images per sample via dynamic shape_list.
  • Batched RoPE: Per-component rotary position embeddings with temporal offsets for each reference image and the target.
  • Norm-guided CFG: Classifier-free guidance with norm rescaling in a single forward pass.

Open source status

  • The model implementation is available.
  • The model weights are available (Only relevant if addition is not a scheduler).

Provide useful links for the implementation

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions