Code and data in support of Examining Language Modeling Assumptions Using an Annotated Literary Dialect Corpus. Presented at NLP4DH 2024 at EMNLP and published in the ACL Anthology.
The corpus (found in data/) consists of 4032 orthovariant tokens and their context drawn from a version of the Project Gutenberg corpus restricted to U.S. Literary works published in the early 19th to early 20th centuries.
Messner provided two data annotations for each sample:
- The modern "standard" version of each orthovariant token
- A dialect tag (dtag) indicating the subject position, as intended by the author, of the utterer of the token. For more details on the annotation process, see the paper.
- Embed the dataset using this code
- Install dependencies using requirements.txt and pip
- Run the experiments using scons -Q.