-
Notifications
You must be signed in to change notification settings - Fork 516
Open
Description
The model parameters in vit.py are examined. It was trained using dinov2l16_384 and has 300M parameters for ViT-Large.
It's known that there are also lighter ViT-Base (86M) and ViT-Samll models. If you could release them separately, their inference speed should be even faster.
vit.py :
ViTPreset = Literal["dinov2l16_384", "dinov2b16_384", "dinov2s16_384", ]
VIT_CONFIG_DICT: dict[ViTPreset, ViTConfig] = {
# ViT-Large (300M) - dinov2l16_384
"dinov2l16_384": ViTConfig(
in_chans=3,
embed_dim=1024,
depth=24,
num_heads=16,
init_values=1e-5,
global_pool="",
),
# ViT-Base(86M) - dinov2b16_384
"dinov2b16_384": ViTConfig(
in_chans=3,
embed_dim=768,
depth=12,
num_heads=12,
init_values=1e-5,
global_pool="",
),
# ViT-Samll - dinov2s16_384
"dinov2s16_384": ViTConfig(
in_chans=3,
embed_dim=384,
depth=12,
num_heads=6,
init_values=1e-5,
global_pool="",
),
}
monodepth.py:
# Map the decoder configuration with the number of output channels
# for each tensor from the decoder output.
MONODEPTH_ENCODER_DIMS_MAP: dict[ViTPreset, list[int]] = {
# For publication
"dinov2l16_384": [256, 512, 1024, 1024],
# ADD
"dinov2b16_384": [192, 384, 768, 768], # ViT-Base
"dinov2s16_384": [96, 192, 384, 384], # ViT-Small
}
MONODEPTH_HOOK_IDS_MAP: dict[ViTPreset, list[int]] = {
# For publication
"dinov2l16_384": [5, 11, 17, 23],
# ADD
"dinov2b16_384": [2, 5, 8, 11], # ViT-Base
"dinov2s16_384": [2, 5, 8, 11], # ViT-Small
}
Hopefully, you can train and release ViT-Base and ViT-Samll.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels