Skip to content

How to understand the results #1

Description

@dkoguciuk

Hi @Seth-Park ,

I'm struggling to understand the evaluation metrics. In the paper you've got Table 2:

Screenshot from 2020-10-21 16-21-06

But after downloading and evaluating your pretrained model I got the following numbers:

------------semantic change best result-------------
CIDEr: 1.00742455128 (test)
Bleu_4: 0.511085051903 (test)
Bleu_3: 0.612453337061 (test)
Bleu_2: 0.712512983841 (test)
Bleu_1: 0.80904675167 (test)
ROUGE_L: 0.654282229769 (test)
METEOR: 0.334430665011 (test)
SPICE: 0.2793739702 (test)
------------non-semantic change best result-------------
CIDEr: 1.14646062504 (test)
Bleu_4: 0.618167729466 (test)
Bleu_3: 0.64995045894 (test)
Bleu_2: 0.715953303178 (test)
Bleu_1: 0.783191698339 (test)
ROUGE_L: 0.763303090909 (test)
METEOR: 0.50608216891 (test)
SPICE: 0.346267623357 (test)
------------total best result-------------
CIDEr: 1.14955152668 (test)
Bleu_4: 0.535546570013 (test)
Bleu_3: 0.621429742545 (test)
Bleu_2: 0.71323722181 (test)
Bleu_1: 0.801726535202 (test)
ROUGE_L: 0.708792660339 (test)
METEOR: 0.37936030774 (test)
SPICE: 0.312820796779 (test)

So, I believe I should multiply those metrics by 100, right? But those are better than in the paper, i.e. in the TOTAL section:

Bleu_4 pretrained: 53.6 > Bleu_4 reported 47.3
CIDEr pretrained: 115.0 > CIDEr reported 112.3
METEOR pretrained 37.9 > METEOR reported 33.9 
SPICE pretrained 31.3 > SPICE reported 24.5

Is there any particular reason why you reported smaller numbers in the paper?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions