EZSR: Event-based Zero-Shot Recognition

This repository contains the PyTorch code for our paper "EZSR: Event-based Zero-Shot Recognition".

arxiv | project page

Introduction

This paper studies zero-shot object recognition using event camera data. Guided by CLIP, which is pre-trained on RGB images, existing approaches achieve zero-shot object recognition by maximizing embedding similarities between event data encoded by an event encoder and RGB images encoded by the CLIP image encoder. Alternatively, several methods learn RGB frame reconstructions from event data for the CLIP image encoder. However, these approaches often result in suboptimal zero-shot performance.

This study develops an event encoder without relying on additional reconstruction networks. We theoretically analyze the performance bottlenecks of previous approaches: global similarity-based objective (i.e., maximizing the embedding similarities) cause semantic misalignments between the learned event embedding space and the CLIP text embedding space due to the degree of freedom. To mitigate the issue, we explore a scalar-wise regularization strategy. Furthermore, to scale up the number of events and RGB data pairs for training, we also propose a pipeline for synthesizing event data from static RGB images.

Experimentally, our data synthesis strategy exhibits an attractive scaling property, and our method achieves superior zero-shot object recognition performance on extensive standard benchmark datasets, even compared with past supervised learning approaches. For example, we achieve 47.84% zero-shot accuracy on the N-ImageNet dataset.

Framework

Overview

Heatmap w.r.t to Text

Requirement

torch 2.3.0+cu121
transformers 4.44.0
timm 0.9.16

Usage

import torch
from eva_clip import create_model_and_transforms, get_tokenizer
from dataset.dataset import load_and_preprocess

model_name = "EVA02-CLIP-bigE-14-plus" 
pretrained = "EZSR-CLIP-bigE-14-plus.pt" # path of the downloaded model

event_path = "asset/test_event.npz"
SENSOR_H = 480
SENSOR_W = 640
event_length = 15000
representation = "histogram"
event_viz_path = "event.png"

caption = ["a dragon", "a dog", "a cat"]

device = "cuda" if torch.cuda.is_available() else "cpu"
model, _, preprocess = create_model_and_transforms(model_name, None, force_custom_clip=True)
checkpoint = torch.load(pretrained, map_location="cpu")
model.load_state_dict(checkpoint,strict=False)
        
tokenizer = get_tokenizer(model_name)
model = model.to(device)

event = load_and_preprocess(event_path, SENSOR_H, SENSOR_W, event_length, representation, event_viz_path, preprocess).unsqueeze(0).to(device)
text = tokenizer(["a diagram", "a dog", "a cat"]).to(device)

with torch.no_grad(), torch.cuda.amp.autocast():
    event_features = model.encode_image(event)
    text_features = model.encode_text(text)
    event_features /= event_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)
    text_probs = (100.0 * event_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)  # prints: [[3.2870e-06, 5.1930e-04, 9.9948e-01]]

The event and paired RGB images for asset/test_event.npz.

Evaluation of Zero-shot Event Classification Performance

We provide an example of evaluating CLIP-bigE-14-plus on the N-ImageNet dataset. The dataset can be obtained here.

MODEL=EVA02-CLIP-bigE-14-plus
PRETRAINED=EZSR-CLIP-bigE-14-plus.pt
CUDA_VISIBLE_DEVICES=3 python engine_evaluate.py \
        --batch_size 1 \
        --num_workers 7 \
        --force_custom_clip \
        --model $MODEL \
        --pretrained $PRETRAINED \
        --dataset imagenet \
        --root ...N_ImageNet \             #specify the path accordingly
        --input_filename ...val_file.text  #specify the path accordingly

Please prepare the pre-trained model.

Model Name	Init	Weight
`ViT-B/32`	OpenAI	link
`ViT-B/16`	OpenAI	link
`ViT-B/16`	EVA	link
`ViT-L/14`	OpenAI	link
`ViT-L/14`	EVA	link
`ViT-L/14-336`	OpenAI	link
`ViT-L/14-336`	EVA	link
`ViT-bigE/14`	EVA	link

Note: The provided checkpoint ViT-bigE/14 includes both the text encoder and the visual encoder. For other checkpoints, only the visual encoder is included to save space. To use these, please load the corresponding text encoder from either EVA EVA or from OpenCLIP OpenAI. We use text encoders from both the EVA-02 series and the OpenAI series.

How to get the dataset

# code is coming soon

Contact

If you have any questions relating to our work, do not hesitate to contact me.

Acknowledgement

EZSR is built using the awesome OpenCLIP, EVA, BEiT, DeiT, N-ImageNet, and mae.

Citation

@misc{yang2024ezsreventbasedzeroshotrecognition,
      title={EZSR: Event-based Zero-Shot Recognition}, 
      author={Yan Yang and Liyuan Pan and Dongxu Li and Liu Liu},
      year={2024},
      eprint={2407.21616},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2407.21616}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EZSR: Event-based Zero-Shot Recognition

Introduction

Framework

Overview

Heatmap w.r.t to Text

Requirement

Usage

Evaluation of Zero-shot Event Classification Performance

How to get the dataset

Contact

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
asset		asset
dataset		dataset
eva_clip		eva_clip
util		util
README.md		README.md
engine_evaluate.py		engine_evaluate.py

Folders and files

Latest commit

History

Repository files navigation

EZSR: Event-based Zero-Shot Recognition

Introduction

Framework

Overview

Heatmap w.r.t to Text

Requirement

Usage

Evaluation of Zero-shot Event Classification Performance

How to get the dataset

Contact

Acknowledgement

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages