Hi @Seth-Park ,
I'm struggling to understand the evaluation metrics. In the paper you've got Table 2:

But after downloading and evaluating your pretrained model I got the following numbers:
------------semantic change best result-------------
CIDEr: 1.00742455128 (test)
Bleu_4: 0.511085051903 (test)
Bleu_3: 0.612453337061 (test)
Bleu_2: 0.712512983841 (test)
Bleu_1: 0.80904675167 (test)
ROUGE_L: 0.654282229769 (test)
METEOR: 0.334430665011 (test)
SPICE: 0.2793739702 (test)
------------non-semantic change best result-------------
CIDEr: 1.14646062504 (test)
Bleu_4: 0.618167729466 (test)
Bleu_3: 0.64995045894 (test)
Bleu_2: 0.715953303178 (test)
Bleu_1: 0.783191698339 (test)
ROUGE_L: 0.763303090909 (test)
METEOR: 0.50608216891 (test)
SPICE: 0.346267623357 (test)
------------total best result-------------
CIDEr: 1.14955152668 (test)
Bleu_4: 0.535546570013 (test)
Bleu_3: 0.621429742545 (test)
Bleu_2: 0.71323722181 (test)
Bleu_1: 0.801726535202 (test)
ROUGE_L: 0.708792660339 (test)
METEOR: 0.37936030774 (test)
SPICE: 0.312820796779 (test)
So, I believe I should multiply those metrics by 100, right? But those are better than in the paper, i.e. in the TOTAL section:
Bleu_4 pretrained: 53.6 > Bleu_4 reported 47.3
CIDEr pretrained: 115.0 > CIDEr reported 112.3
METEOR pretrained 37.9 > METEOR reported 33.9
SPICE pretrained 31.3 > SPICE reported 24.5
Is there any particular reason why you reported smaller numbers in the paper?
Hi @Seth-Park ,
I'm struggling to understand the evaluation metrics. In the paper you've got Table 2:
But after downloading and evaluating your pretrained model I got the following numbers:
So, I believe I should multiply those metrics by 100, right? But those are better than in the paper, i.e. in the TOTAL section:
Is there any particular reason why you reported smaller numbers in the paper?