Acceptability judgment task dataset based on the sentences written by non-native English speakers
AJT is a common method in empirical linguistics to gather information about the internal grammar of speakers of a language, which is considered a promising area to evaluate neural language models’ linguistic knowledge. There is a Corpus of Linguistic Acceptability (CoLA) whose creators think Boolean judgements sufficient; similarly, some non-English resources cast acceptability as a binary classification task.
NNS-500 dataset based on the sentences written by non-native speakers (which is important from the point of view of the source of unacceptable sentences) and labelled by a university English teacher is intended for testing the pre-trained neural networks. It has 350 acceptable and 150 unacceptable sentences, which is 70% of acceptability (this compares to 69.2% in the CoLA out-of-domain set).
Dataset: https://github.com/yualeks63/NNS-500/blob/main/NNS-500_dataset.csv
More information: https://github.com/yualeks63/NNS-500/blob/main/NNS-500_dataset_description.pdf