-
Notifications
You must be signed in to change notification settings - Fork 7
Description
ISSUE: Using sciSpacy, to create CSV I noted a number of issues in the entities it found and the labels attached to them.
NOTE: I suspect at least some of the problem could be due to the output being comma-delimited, so I propose we try with tab-delimited output, and I'll re-run this test corpus and compare.
In the attached PDF and CSV you'll see I added two new columns — Anomaly and issue. (I did not identify the issue for most of these, but you'll get the gist)
In this case I was focusing on what entities were mis-labeled as DISEASE
The types of errors include the following being identified as DISEASE:
- email address
- author names (or parts thereof)
- apostrophe's (in many cases a lone apostrophe was labelled a DISEASE. Maybe before exporting we include a step to replace all smart quotes and apostrophes with dumb ones?)
- plant names and extracts
- organization names (or parts thereof)
- chemical compounds
- names of proteins
- measurements (or parts thereof)
- fatty acid is treated as a disease throughout
- microbes are treated as a disease throughout, but I suspect that is intentional
- factors such as TNF-alpha
- chemical terms such as dissolution/solubility
- Moroccan cultural heritage
Also, noticed some mis-labeled as CHEMICAL
- COVID-19
- random numbers
Also, many Abbreviations came up as entities, but were not expanded in the abbreviations_longform column