Skip to content

Anomalous entity entries in CSV output #33

@EmanuelFaria

Description

@EmanuelFaria

ISSUE: Using sciSpacy, to create CSV I noted a number of issues in the entities it found and the labels attached to them.

NOTE: I suspect at least some of the problem could be due to the output being comma-delimited, so I propose we try with tab-delimited output, and I'll re-run this test corpus and compare.

In the attached PDF and CSV you'll see I added two new columns — Anomaly and issue. (I did not identify the issue for most of these, but you'll get the gist)

In this case I was focusing on what entities were mis-labeled as DISEASE

The types of errors include the following being identified as DISEASE:

  • email address
  • author names (or parts thereof)
  • apostrophe's (in many cases a lone apostrophe was labelled a DISEASE. Maybe before exporting we include a step to replace all smart quotes and apostrophes with dumb ones?)
  • plant names and extracts
  • organization names (or parts thereof)
  • chemical compounds
  • names of proteins
  • measurements (or parts thereof)
  • fatty acid is treated as a disease throughout
  • microbes are treated as a disease throughout, but I suspect that is intentional
  • factors such as TNF-alpha
  • chemical terms such as dissolution/solubility
  • Moroccan cultural heritage

Also, noticed some mis-labeled as CHEMICAL

  • COVID-19
  • random numbers

Also, many Abbreviations came up as entities, but were not expanded in the abbreviations_longform column

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions