add tests for meval to replicate paper results#605
Conversation
| self.assertGreater(len(sys_info.results.analyses), 0) | ||
| self.assertAlmostEqual( | ||
| overall_score, | ||
| 0.0946, |
There was a problem hiding this comment.
I couldn't find this score in the ROUGE results in the original paper.
There was a problem hiding this comment.
ROUGE1: Table 4 -> Column: NeR18 -> COH
| self.assertGreater(len(sys_info.results.analyses), 0) | ||
| self.assertAlmostEqual( | ||
| overall_score, | ||
| 0.3157, |
There was a problem hiding this comment.
This result didn't be listed in paper, but are stored in their github repo: implementation section: https://github.com/neulab/BARTScore#reproduce
There was a problem hiding this comment.
Thanks! I still didn't see where the number 0.3157 came from though. The Spearman numbers in the table were in the range 0.49.
Could you add a little bit more detail to the comment about where the number could be found?
Sorry, I don't want to be pedantic, but it'd be nice to keep this information around and if we don't do it know we'll likely forget later.
neubig
left a comment
There was a problem hiding this comment.
Thanks for this! Overall, I think this is a great initiative.
However, if the numbers are a little bit different because of the stemmer used for ROUGE, actually they aren't really "replicated" I guess? To truly replicate the results we can probably calculate ROUGE with the same stemmer as the original paper. Would that be hard?
| class MetaEvalNLGNewsroomTest(unittest.TestCase): | ||
| """ | ||
| Test the NLG metric on newsroom dataset and replicate the reported results from | ||
| the paper: https://arxiv.org/pdf/2106.11520.pdf |
There was a problem hiding this comment.
Could you please add some of the details from the PR into this comment here so they remain for later?
Also, I agree with Yusuke's comments below that it'd be a good idea to specify which results were replicated.
There was a problem hiding this comment.
"To truly replicate the results we can probably calculate ROUGE with the same stemmer as the original paper. Would that be hard?"
I won't plan to do this. (1) We need to modify the eaas code. (2) the original paper use a package that will be outdated. no need to support it. Anyway, no plan to do this here.
neubig
left a comment
There was a problem hiding this comment.
Thanks a lot! I was able to confirm the ROUGE number but still wasn't able to find the BARTScore number.
Also, I reviewed this before you clicked the "re-request review button", so if this was still WIP apologies!
| class MetaEvalNLGNewsroomTest(unittest.TestCase): | ||
| """ | ||
| Test the NLG metric on newsroom dataset and replicate the reported results from | ||
| the paper: https://arxiv.org/pdf/2106.11520.pdf |
| .value | ||
| ) | ||
| self.assertGreater(len(sys_info.results.analyses), 0) | ||
| # Replicate the Table 4 result in paper: https://arxiv.org/pdf/2106.11520.pdf |
There was a problem hiding this comment.
| # Replicate the Table 4 result in paper: https://arxiv.org/pdf/2106.11520.pdf | |
| # Replicate the Table 4 result in paper: https://arxiv.org/pdf/2106.11520.pdf | |
| # Specifically: ROUGE1: Table 4 -> Column: NeR18 -> COH |
| self.assertGreater(len(sys_info.results.analyses), 0) | ||
| self.assertAlmostEqual( | ||
| overall_score, | ||
| 0.3157, |
There was a problem hiding this comment.
Thanks! I still didn't see where the number 0.3157 came from though. The Spearman numbers in the table were in the range 0.49.
Could you add a little bit more detail to the comment about where the number could be found?
Sorry, I don't want to be pedantic, but it'd be nice to keep this information around and if we don't do it know we'll likely forget later.
Overview
This PR adds tests to verify whether our implemented meta-evaluation processor is able to replicate reported results from existing published papers.
Relevant issue: https://github.com/inspired-co/taskboard/issues/180
Details
References