Anomalous entity entries in CSV output

ISSUE: Using sciSpacy, to create CSV I noted a number of issues in the `entities` it found and the `labels` attached to them.

**NOTE: I suspect at least some of the problem could be due to the output being comma-delimited, so I propose we try with tab-delimited output, and I'll re-run this test corpus and compare.**

In the attached PDF and CSV you'll see I added two new columns — Anomaly and issue. (I did not identify the issue for most of these, but you'll get the gist)

**In this case I was focusing on what `entities` were mis-labeled as `DISEASE`**

**The types of errors include the following being identified as DISEASE:**
- email address
- author names (or parts thereof)
- apostrophe's (in many cases a lone apostrophe was labelled a DISEASE. Maybe before exporting we include a step to replace all smart quotes and apostrophes with dumb ones?)
- plant names and extracts
- organization names (or parts thereof)
- chemical compounds
- names of proteins
- measurements (or parts thereof)
- fatty acid is treated as a disease throughout
- microbes are treated as a disease throughout, but I suspect that is intentional
- factors such as TNF-alpha
- chemical terms such as dissolution/solubility
- Moroccan cultural heritage


**Also, noticed some mis-labeled as CHEMICAL**
- COVID-19
- random numbers

**Also, many Abbreviations came up as entities, but were not expanded in the `abbreviations_longform` column**

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Anomalous entity entries in CSV output #33

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Anomalous entity entries in CSV output #33

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions