Hi, thank you for the excellent work on MokA! I've been trying to reproduce the results reported in your paper, but I'm encountering a significant gap between my results and the reported numbers. I would greatly appreciate your help in identifying what might be causing this discrepancy.
Environment Setup
- Python: 3.9
- PyTorch: 2.1.0
- Transformers: 4.37.2
- DeepSpeed: 0.12.6
- Hardware: 6× A100 80GB GPUs
Pre-trained Weights Used
I used the pre-trained projector weights provided in your repository:
- ✅ Audio projector:
pre-trained/av_unified/audio-pretrain/non_lora_trainables.bin
- ✅ Visual projector:
pre-trained/av_unified/visual-pretrain/non_lora_trainables.bin
- ✅ LLaMA-2-7B-Chat-HF
- ✅ CLIP-ViT-L/14
- ✅ BEATs (Fine-tuned BEATs_iter3+ AS2M)
Training Configuration
I followed the configuration in scripts/finetune/ft.sh:
| Hyperparameter |
My Setting |
From ft.sh |
| LLM Backbone |
LLaMA-2-7B-Chat |
LLaMA-2-7B-Chat |
| LoRA Rank |
444 (4×3 modalities) |
444 |
| LoRA Alpha |
16 |
16 |
| LoRA Dropout |
0.05 |
0.05 |
| Learning Rate |
1e-4 |
1e-4 |
| Weight Decay |
0.0 |
0.0 |
| Warmup Ratio |
0.03 |
0.03 |
| LR Scheduler |
Cosine |
Cosine |
| Epochs |
3 |
3 |
| Per-device Batch Size |
4 |
4 |
| Gradient Accumulation |
1 |
1 |
| GPUs |
6 |
16 (mentioned in README) |
| Global Batch Size |
24 |
64 (16 GPUs × 4) |
| BF16 |
False |
True |
| Visual Query Tokens |
32 |
32 |
| Audio Query Tokens |
32 |
32 |
| Video Frames |
10 |
10 |
My Reproduction Results
MUSIC-AVQA Dataset
| Metric |
My Result |
Paper (Table 1, LLaMA2) |
Gap |
| Overall Accuracy |
70.23% |
75.71% |
-5.48% |
Evaluation on 9185 test samples.
AVE Dataset
| Metric |
My Result |
Paper (Table 1, LLaMA2) |
Gap |
| Event Classification |
94.78% |
- |
- |
| Temporal Localization (±1s) |
68.16% |
- |
- |
| Joint Accuracy |
64.68% |
74.68% |
-10.00% |
Evaluation on 402/402 test samples.
Questions
-
Global Batch Size: The README mentions using 16 A100 GPUs for fine-tuning, which would give a global batch size of 64 (16 × 4). I only have 6 GPUs available, resulting in a global batch size of 24. Could this difference significantly impact the final performance?
-
AVE Evaluation Metric: The paper reports a single number (74.68%) for AVE. Is this the joint accuracy (both event classification and temporal localization correct)? Or is it a different metric?
-
Fine-tuned Checkpoints: Would it be possible to release the fine-tuned model checkpoints for MUSIC-AVQA and AVE? This would help verify whether the gap is due to training configuration differences or evaluation methodology.
-
Data Splits: Are you using the official train/test splits for MUSIC-AVQA and AVE? I'm using:
- MUSIC-AVQA: 9185 test samples
- AVE: 402 test samples
-
Any Other Critical Settings: Are there any other hyperparameters or settings not mentioned in the scripts that might be crucial for reproduction?
Thank you for your time and assistance!
Hi, thank you for the excellent work on MokA! I've been trying to reproduce the results reported in your paper, but I'm encountering a significant gap between my results and the reported numbers. I would greatly appreciate your help in identifying what might be causing this discrepancy.
Environment Setup
Pre-trained Weights Used
I used the pre-trained projector weights provided in your repository:
pre-trained/av_unified/audio-pretrain/non_lora_trainables.binpre-trained/av_unified/visual-pretrain/non_lora_trainables.binTraining Configuration
I followed the configuration in
scripts/finetune/ft.sh:ft.shMy Reproduction Results
MUSIC-AVQA Dataset
Evaluation on 9185 test samples.
AVE Dataset
Evaluation on 402/402 test samples.
Questions
Global Batch Size: The README mentions using 16 A100 GPUs for fine-tuning, which would give a global batch size of 64 (16 × 4). I only have 6 GPUs available, resulting in a global batch size of 24. Could this difference significantly impact the final performance?
AVE Evaluation Metric: The paper reports a single number (74.68%) for AVE. Is this the joint accuracy (both event classification and temporal localization correct)? Or is it a different metric?
Fine-tuned Checkpoints: Would it be possible to release the fine-tuned model checkpoints for MUSIC-AVQA and AVE? This would help verify whether the gap is due to training configuration differences or evaluation methodology.
Data Splits: Are you using the official train/test splits for MUSIC-AVQA and AVE? I'm using:
Any Other Critical Settings: Are there any other hyperparameters or settings not mentioned in the scripts that might be crucial for reproduction?
Thank you for your time and assistance!