Dataset and replication package for Histories of England text reuse article.
Access the published article here: A Computational Approach to Literary Borrowing in Enlightenment Britain
CC BY-NC. https://creativecommons.org/licenses/by-nc/4.0/
If using the data, cite the published article:
Vaara, Ville. "Chapter 2 Charting the Circulation of Histories of England: A Computational Approach to Literary Borrowing in Enlightenment Britain" In Enlightenment Histories edited by Marc Hanvelt, Mark Gregory Spencer and Mikko Sakari Tolonen, 37-72. De Gruyter Oldenbourg, 2026.
https://doi.org/10.1515/9783111637181-003
Each article_* -script is independent and produces part of the article data and plot content.
article0_datasets.py loads common source datasets and is used by the other scripts.
config_visuals.py sets the visual style of the plots.
Note that cloning all the data requires Git LFS.
Found in data/raw.
hoe_meta.csv - Histories of England titles metadata.
- 'manifestation_id': Unique id for each physical book.
- 'estc_id': ID of title in English Short Title Catalogue.
- 'actor_id': Unique ID for each author.
- 'name_unified': Name of author.
- 'publication_year': Publication year of the volume / edition.
- 'title': Title of the volume / edition.
- 'text_length': Length of text, in characters.
- 'work_id': Unique ID of the work.
- 'main_category': Genre category of the work. All are Histories of England.
- 'text_length_p': Length of text, in pages. (characters / 3000)
- 'sequence': First or subsequent edition.
- 'author_volume_group': Unique ID for the volume group. Grouping volumes across editions, e.g. First Volume of Hume's History in each edition, etc.
- 'publication_decade': Publication decade iof the volume / edition.
- 'originality': Originality ratio of the volume.
reception_inception_coverage.csv - Reuse numbers between titles.
- 'src_manifestation_id': Manifestation ID of the source document.
- 'dst_manifestation_id': Manifestation ID of the destination document.
- 'coverage_src_in_dst_abs': Absolute number of characters reused.
- 'same_author': Are the authors the same for both manifestations.
coverage_full.csv - Pairwise coverage between titles.
- 'mi1' - Manifestation ID of title 1.
- 'mi2' - Manifestation ID of title 2.
- 'reuse_t1_t2' - Characters in 1 reused in 2.
- 'reuse_t2_t1' - Characters in 2 reused in 1.
- 'coverage_t1_t2' - Portion of 1 reused in 2.
- 'coverage_t2_t1' - Portion of 2 reused in 1.
- 't1_length' - Length of 1.
- 't2_length' - Length of 2.
- 'coverage_max' - Max of 'coverage_t1_t2' and 'coverage_t2_t1'.
- 'reuse_max' - Max of 'reuse_t1_t2' and 'reuse_t2_t1'.
- 'publication_year1' - Publication year of 1.
- 'publication_year2' - Publication year of 2.
- 'same_author' - Are both authors the same.
- 'coverage_directed' - Coverage of earlier publication in later.
- 'both_authors_present' - Do both titles have authors in the metadata.
as_original_source_maps/dst/*.csv - Text reuses originating from each title. Filename is Manifestation ID of origin.
- 'dst_trs_start': Character index of beginning of reuse in destination.
- 'dst_trs_end': Character index of end of reuse in destination.
- 'dst_mi': Manifestation ID of the destination.
as_original_source_maps/src/*.csv - Text reuses originating from each title. Filename is Manifestation ID of origin.
- 'dst_trs_start': Character index of beginning of reuse in source.
- 'dst_trs_end': Character index of end of reuse in source.
- 'dst_mi': Manifestation ID of the destination.
originality_maps/*.json - Original segments of each title. Filename denotes Manifestation ID.
- "manifestation_id": Manifestation ID of the manifestation.
- "originality_ratio": 0-1. Ratio of original content to all content.
- "originality_segments": Character indices of original segments.