Skip to content

Clean overlapping annotation spans and fix evaluation#3

Open
monradach wants to merge 2 commits intoCLU-UML:mainfrom
monradach:improve/clean-data-and-eval
Open

Clean overlapping annotation spans and fix evaluation#3
monradach wants to merge 2 commits intoCLU-UML:mainfrom
monradach:improve/clean-data-and-eval

Conversation

@monradach
Copy link

  • Original MedDec annotations contain many overlapped and nested spans (305 files with overlaps, 2,332 overlapping pairs total) 98994_178949_30340_original.pdf
  • clean_data.py now sorts annotations by start_offset before saving
  • Add cleanup logic for same_class_overlap, same_class_nested, and multi_class_nested spans (remaining overlapping pairs after cleanup: 385) 98994_178949_30340_updated.pdf
  • For remaining overlaps, evaluate.py enforces a single label per token like the original script: if multiple spans cover the same token, the first span encountered takes priority
  • Fix evaluate.py: improve file ID matching in splits and fix '_' handling in load_gold_annotations

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant