Experiments Clustering Literary Variant Orthography

Code and data in support of Examining Language Modeling Assumptions Using an Annotated Literary Dialect Corpus. Presented at NLP4DH 2024 at EMNLP and published in the ACL Anthology.

Data Details

The corpus (found in data/) consists of 4032 orthovariant tokens and their context drawn from a version of the Project Gutenberg corpus restricted to U.S. Literary works published in the early 19th to early 20th centuries.

Messner provided two data annotations for each sample:

The modern "standard" version of each orthovariant token
A dialect tag (dtag) indicating the subject position, as intended by the author, of the utterer of the token. For more details on the annotation process, see the paper.

Running the Code

Embed the dataset using this code
Install dependencies using requirements.txt and pip
Run the experiments using scons -Q.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
data		data
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SConstruct		SConstruct
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Experiments Clustering Literary Variant Orthography

Data Details

Running the Code

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Experiments Clustering Literary Variant Orthography

Data Details

Running the Code

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages