-
Notifications
You must be signed in to change notification settings - Fork 1
Coverage tests
Lenz edited this page Jun 7, 2018
·
7 revisions
- Terminology: 2012 snapshot of MEDICS (taken from DNorm), convert all names and synonyms to lower-case and map them to all IDs (preferred + alternative)
- Corpus: NCBI-disease training set
- index the terminology names with simstring
- use cosine similarity with varying thresholds to retrieve similarly looking terminology entries for each mention in the corpus (the candidate set fed into the CNN)
- if at least one of the candidates links to an ID found in the reference (compound concepts: a link for each component), count this as a match
- compute the proportion of matches (coverage) and the mean size (and std.dev.) of the candidate sets for every threshold value
- compute separate scores with duplicates removed (distinct mentions)
| Threshold | Coverage | Candidates | Coverage (distinct) | Candidates (distinct) |
|---|---|---|---|---|
| .7 | 0.739 | 9.8 (26.8) | 0.546 | 7.25 (19.9) |
| .6 | 0.782 | 27.0 (63.8) | 0.643 | 23.1 (54.1) |
| .5 | 0.834 | 86.1 (194) | 0.733 | 79.5 (169) |
| .4 | 0.859 | 270 (554) | 0.796 | 253 (480) |
| .3 | 0.885 | 696 (1230) | 0.853 | 705 (1080) |
| .2 | 0.904 | 1290 (1800) | 0.887 | 1510 (1670) |
| .1 | 0.922 | 2940 (2740) | 0.921 | 3970 (2570) |
Number of mentions: 5145
Number of distinct mentions: 1596
Same as for cosine similarity, but instead of simstring using another candidate retrieval mechanism:
- dictionary entries are indexed by their character skip-grams
- candidates are generated by picking the top-x entries with the highest skip-gram overlap with a given mention
- ties (multiple candidates with the same overlap) are resolved by preferring shorter candidates
- seven different values for x (number of candidates per mention) are evaluated
- two different skip-gram configurations are evaluated:
- 1-skip-bigrams and 1-skip-trigrams (ie. regular bi-/trigrams plus bi-/trigrams with a one-character gap)
- 1-skip-bigrams and 2-skip-trigrams (the same, but two gaps are allowd for trigrams)
- separate figures are computed with duplicates removed (distinct mentions)
All fractional numbers denote coverage.
| Candidates | 2,1- and 3,1-grams | 2,1- and 3,2-grams | 2,1- and 3,1-grams (distinct) | 2,1- and 3,2-grams (distinct) |
|---|---|---|---|---|
| 2 | 0.726 | 0.731 | 0.515 | 0.518 |
| 4 | 0.751 | 0.750 | 0.550 | 0.550 |
| 8 | 0.768 | 0.764 | 0.587 | 0.585 |
| 16 | 0.790 | 0.792 | 0.626 | 0.628 |
| 32 | 0.807 | 0.807 | 0.665 | 0.666 |
| 64 | 0.825 | 0.824 | 0.702 | 0.702 |
| 128 | 0.842 | 0.842 | 0.737 | 0.739 |
Same data as above.
- pretrained word embeddings provided by Haodi Li et al. 2017.
- phrase vectors are obtained by summing or averaging over the word vectors of all words in a mention/dictionary entry
- candidates are generated with gensim's
KeyedVectors.most_similarmethod
All fractional numbers denote coverage.
"sum" and "mean" are two different strategies for combining word vectors into phrase vectors.
| Candidates | sum | mean | sum (distinct) | mean (distinct) |
|---|---|---|---|---|
| 2 | 0.375 | 0.396 | 0.345 | 0.346 |
| 4 | 0.407 | 0.436 | 0.392 | 0.398 |
| 8 | 0.485 | 0.512 | 0.447 | 0.450 |
| 16 | 0.537 | 0.546 | 0.499 | 0.502 |
| 32 | 0.589 | 0.591 | 0.555 | 0.557 |
| 64 | 0.637 | 0.639 | 0.614 | 0.616 |
| 128 | 0.669 | 0.667 | 0.659 | 0.660 |
| 256 | 0.713 | 0.715 | 0.708 | 0.713 |
| 512 | 0.772 | 0.774 | 0.764 | 0.766 |
| 1024 | 0.817 | 0.820 | 0.820 | 0.826 |
Same as the previous, but the embeddings are trained on subword-units of PubMed abstracts instead of actual words.
All fractional numbers denote coverage.
"sum" and "mean" are two different strategies for combining subword-unit vectors into phrase vectors.
| Candidates | sum | mean | sum (distinct) | mean (distinct) |
|---|---|---|---|---|
| 2 | 0.122 | 0.122 | 0.112 | 0.112 |
| 4 | 0.144 | 0.144 | 0.140 | 0.141 |
| 8 | 0.192 | 0.192 | 0.179 | 0.179 |
| 16 | 0.221 | 0.221 | 0.218 | 0.218 |
| 32 | 0.258 | 0.259 | 0.267 | 0.268 |
| 64 | 0.292 | 0.292 | 0.309 | 0.309 |
| 128 | 0.344 | 0.343 | 0.368 | 0.367 |
| 256 | 0.418 | 0.418 | 0.425 | 0.424 |
| 512 | 0.471 | 0.471 | 0.493 | 0.493 |
| 1024 | 0.550 | 0.550 | 0.565 | 0.565 |
Raw data and (generalized) script
Candidates are taken equally from both generators.
| Candidates | Coverage | Coverage (distinct) |
|---|---|---|
| 2 | 0.732 | 0.536 |
| 4 | 0.762 | 0.581 |
| 8 | 0.795 | 0.626 |
| 16 | 0.814 | 0.671 |
| 32 | 0.833 | 0.718 |
| 64 | 0.852 | 0.764 |
| 128 | 0.873 | 0.801 |
| 256 | 0.888 | 0.830 |
| 512 | 0.905 | 0.861 |
| 1024 | 0.917 | 0.891 |