DRIP

Dynamic Patch Pooling for Efficient Visual Instruction Tuning

Environment setup

Create a new conda enviornment from scratch:

module load miniconda3/24.1.2-py310 # for OSC
module load conda # for Anvil
conda create -n DRIP python=3.11 -y
conda activate DRIP
python -m pip install -r requirements.txt

activate an existing one:

module load miniconda3/24.1.2-py310 # for OSC
module load conda # for Anvil
conda deactivate
conda activate DRIP

ImageNet (OSC pitzer)

running experiments:

sbatch scripts/task1/finetune_imagenet.sh

boundary visualization & attention map analysis:

for ImageNet

python src/boundary_visual_IN.py

GFLOPs measurement:

# DRIP
python src/FLOP.py --mode DRIP --compression_rate 0.25
# Fixed pooling
python src/FLOP.py --mode fixed_pooling --compression_rate 0.25
# original ViT
python src/FLOP.py --mode ViT

examples:

boundaries	attention maps
imagenet_DRIP_4x_01_warmup2	model_299.pth

imagenet_DRIP_4x_half_LR_no_warmup	model_186.pth

LLaVA

Instruction

Go to file src/LLaVA_wrapper/llava_local/model/multimodal_encoder/builder.py to configure merging strategies(ViT/original, Fixed/fixed pooling, DRIP/dynamic tokenization) and corresponding compression rate (0.5/2x, 0.25/4x, 0.1/10x):

MERGE_STRATEGY = "Fixed" # "ViT" or "DRIP" "Fixed" and more!
COMPRESSION_RATE = 0.25

Additional note: the ViT backbone from LLaVA checkpoint is openai/clip-vit-large-patch14-336.

Then we are good to move onto benchmark experiments.

Evaluation/Benchmarks

General VQA (4):

# SQA 
sbatch scripts/task3/eval/eval_SQA.sh
# MM-Bench
sbatch scripts/task3/eval/eval_MMBench.sh
# MME
sbatch scripts/task3/eval/eval_MME.sh
# VQAv2 [🚨LONG🚨]
# need to submit the result json file to:
# https://eval.ai/web/challenges/challenge-page/830
sbatch scripts/task3/eval/eval_VQAv2.sh

Reasoning (1):

# GQA
sbatch scripts/task3/eval/eval_GQA.sh

OCR (1):

# TextVQA
sbatch scripts/task3/eval/eval_textVQA.sh

Hallucination (1):

# POPE
sbatch scripts/task3/eval/eval_POPE.sh

Free Response (1):

# LLaVA-in-the-wild
sbatch scripts/task3/eval/eval_in_the_wild.sh

LLaVA Finetuning

flash attention

Before anything, make sure flash attention is installed:

# install
sbatch flash_attn.sh
# test
sbatch test_flash_attn.sh
# what to expect: 
# torch.Size([1, 128, 8, 64]) torch.float16 cuda:0

pretraining (token alignment)

# ascend
sbatch scripts/task3/pretrain_ascend.sh
# anvil

finetuning/VQA SFT

sbatch scripts/task3/finetune.sh

Results

Contacts

If you have any questions or suggestions, feel free to contact:

Yusen Peng (peng.1007@osu.edu)
Sachin Kumar (kumar.1145@osu.edu)

Or describe it in Issues.

Name		Name	Last commit message	Last commit date
Latest commit History 365 Commits
.vscode		.vscode
FT_QwenVL		FT_QwenVL
docs		docs
results		results
scripts		scripts
src		src
unit_further_vis		unit_further_vis
unit_inference_images		unit_inference_images
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
README_QWEN.md		README_QWEN.md
flash_attn.sh		flash_attn.sh
requirements.txt		requirements.txt
test_flash_attn.sh		test_flash_attn.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DRIP

Dynamic Patch Pooling for Efficient Visual Instruction Tuning

Environment setup

ImageNet (OSC pitzer)

LLaVA

Instruction

Evaluation/Benchmarks

LLaVA Finetuning

flash attention

pretraining (token alignment)

finetuning/VQA SFT

Results

Contacts

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DRIP

Dynamic Patch Pooling for Efficient Visual Instruction Tuning

Environment setup

ImageNet (OSC pitzer)

LLaVA

Instruction

Evaluation/Benchmarks

LLaVA Finetuning

flash attention

pretraining (token alignment)

finetuning/VQA SFT

Results

Contacts

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages