-
Notifications
You must be signed in to change notification settings - Fork 31
Open
Description
Context:
I am reviewing the implementation of the forward pass in the Flow-based Custom Transformer for the Singing Voice Synthesis (SVS) task. While comparing the current codebase with the provided paper description, I noticed a potential discrepancy regarding how the content embedding
Discrepancy Details:
-
Paper Description: The paper states:
"...We then concatenate
$x_t$ with the content embedding$z_c$ from the BBC Encoder... This allows the model to use self-attention to learn content and style transfer." -
Current Code Implementation: In the current
forwardfunction (https://github.com/AaronZ345/TCSinger2/blob/main/ldm/modules/diffusionmodules/tcsinger2.py#L403-L404
), the content embedding (derived from MIDI and phonemes) is added to the combined latent variable using a gated mechanism:
# Combine midi and phoneme embeddings
content = midi + ph
content = self.final_proj(content.transpose(1, 2)).transpose(1, 2)
# ... (x_combined is defined as concat of prompt and x)
# Current injection method: Gated Addition (Line 85-87)
gate = torch.sigmoid(self.gate_content(content))
x_combined += content * gateReactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels