add tests for meval to replicate paper results by pfliu-nlp · Pull Request #605 · neulab/ExplainaBoard

pfliu-nlp · 2022-12-27T21:34:00Z

Overview

This PR adds tests to verify whether our implemented meta-evaluation processor is able to replicate reported results from existing published papers.

Relevant issue: https://github.com/inspired-co/taskboard/issues/180

Details

Collect system outputs from this repo of two metrics (rouge1 and bartscore)
Using Explainaboard to process these outputs and compare the results with the ones reported from the above repo.

References

Paper: BARTSCORE: Evaluating Generated Text as Text Generation
Code: https://github.com/neulab/BARTScore

odashi · 2022-12-28T00:27:39Z

integration_tests/meta_eval_nlg_test.py

+        self.assertGreater(len(sys_info.results.analyses), 0)
+        self.assertAlmostEqual(
+            overall_score,
+            0.0946,


I couldn't find this score in the ROUGE results in the original paper.

ROUGE1: Table 4 -> Column: NeR18 -> COH

odashi · 2022-12-28T00:32:12Z

integration_tests/meta_eval_nlg_test.py

+        self.assertGreater(len(sys_info.results.analyses), 0)
+        self.assertAlmostEqual(
+            overall_score,
+            0.3157,


This result didn't be listed in paper, but are stored in their github repo: implementation section: https://github.com/neulab/BARTScore#reproduce

Thanks! I still didn't see where the number 0.3157 came from though. The Spearman numbers in the table were in the range 0.49.

Could you add a little bit more detail to the comment about where the number could be found?

Sorry, I don't want to be pedantic, but it'd be nice to keep this information around and if we don't do it know we'll likely forget later.

neubig

Thanks for this! Overall, I think this is a great initiative.

However, if the numbers are a little bit different because of the stemmer used for ROUGE, actually they aren't really "replicated" I guess? To truly replicate the results we can probably calculate ROUGE with the same stemmer as the original paper. Would that be hard?

neubig · 2022-12-28T13:05:50Z

integration_tests/meta_eval_nlg_test.py

+class MetaEvalNLGNewsroomTest(unittest.TestCase):
+    """
+    Test the NLG metric on newsroom dataset and replicate the reported results from
+    the paper: https://arxiv.org/pdf/2106.11520.pdf


Could you please add some of the details from the PR into this comment here so they remain for later?

Also, I agree with Yusuke's comments below that it'd be a good idea to specify which results were replicated.

"To truly replicate the results we can probably calculate ROUGE with the same stemmer as the original paper. Would that be hard?"

I won't plan to do this. (1) We need to modify the eaas code. (2) the original paper use a package that will be outdated. no need to support it. Anyway, no plan to do this here.

Yeah, that's fair enough!

neubig

Thanks a lot! I was able to confirm the ROUGE number but still wasn't able to find the BARTScore number.

Also, I reviewed this before you clicked the "re-request review button", so if this was still WIP apologies!

neubig · 2022-12-31T16:59:07Z

integration_tests/meta_eval_nlg_test.py

+class MetaEvalNLGNewsroomTest(unittest.TestCase):
+    """
+    Test the NLG metric on newsroom dataset and replicate the reported results from
+    the paper: https://arxiv.org/pdf/2106.11520.pdf


Yeah, that's fair enough!

neubig · 2022-12-31T17:00:40Z

integration_tests/meta_eval_nlg_test.py

+            .value
+        )
+        self.assertGreater(len(sys_info.results.analyses), 0)
+        # Replicate the Table 4 result in paper: https://arxiv.org/pdf/2106.11520.pdf


Suggested change

# Replicate the Table 4 result in paper: https://arxiv.org/pdf/2106.11520.pdf

# Replicate the Table 4 result in paper: https://arxiv.org/pdf/2106.11520.pdf

# Specifically: ROUGE1: Table 4 -> Column: NeR18 -> COH

neubig · 2022-12-31T17:05:36Z

integration_tests/meta_eval_nlg_test.py

+        self.assertGreater(len(sys_info.results.analyses), 0)
+        self.assertAlmostEqual(
+            overall_score,
+            0.3157,


Thanks! I still didn't see where the number 0.3157 came from though. The Spearman numbers in the table were in the range 0.49.

Could you add a little bit more detail to the comment about where the number could be found?

Sorry, I don't want to be pedantic, but it'd be nice to keep this information around and if we don't do it know we'll likely forget later.

add tests for meval to replicate paper results

0817722

pfliu-nlp marked this pull request as ready for review December 27, 2022 22:23

pfliu-nlp requested review from neubig and odashi as code owners December 27, 2022 22:23

odashi reviewed Dec 28, 2022

View reviewed changes

neubig reviewed Dec 28, 2022

View reviewed changes

add comments & update files

6fa4dd4

neubig reviewed Dec 31, 2022

View reviewed changes

	# Replicate the Table 4 result in paper: https://arxiv.org/pdf/2106.11520.pdf
	# Replicate the Table 4 result in paper: https://arxiv.org/pdf/2106.11520.pdf
	# Specifically: ROUGE1: Table 4 -> Column: NeR18 -> COH

Conversation

pfliu-nlp commented Dec 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Details

References

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

neubig left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

neubig left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pfliu-nlp commented Dec 27, 2022 •

edited

Loading