Skip to content

Coverage tests

Lenz edited this page Jun 7, 2018 · 7 revisions

Cosine similarity

Setup

  • Terminology: 2012 snapshot of MEDICS (taken from DNorm), convert all names and synonyms to lower-case and map them to all IDs (preferred + alternative)
  • Corpus: NCBI-disease training set
  • index the terminology names with simstring
  • use cosine similarity with varying thresholds to retrieve similarly looking terminology entries for each mention in the corpus (the candidate set fed into the CNN)
  • if at least one of the candidates links to an ID found in the reference (compound concepts: a link for each component), count this as a match
  • compute the proportion of matches (coverage) and the mean size (and std.dev.) of the candidate sets for every threshold value
  • compute separate scores with duplicates removed (distinct mentions)

Results

Threshold Coverage Candidates Coverage (distinct) Candidates (distinct)
.7 0.739 9.8 (26.8) 0.546 7.25 (19.9)
.6 0.782 27.0 (63.8) 0.643 23.1 (54.1)
.5 0.834 86.1 (194) 0.733 79.5 (169)
.4 0.859 270 (554) 0.796 253 (480)
.3 0.885 696 (1230) 0.853 705 (1080)
.2 0.904 1290 (1800) 0.887 1510 (1670)
.1 0.922 2940 (2740) 0.921 3970 (2570)

Number of mentions: 5145
Number of distinct mentions: 1596

Raw data and script

Absolute skip-gram overlap

Setup

Same as for cosine similarity, but instead of simstring using another candidate retrieval mechanism:

  • dictionary entries are indexed by their character skip-grams
  • candidates are generated by picking the top-x entries with the highest skip-gram overlap with a given mention
  • ties (multiple candidates with the same overlap) are resolved by preferring shorter candidates
  • seven different values for x (number of candidates per mention) are evaluated
  • two different skip-gram configurations are evaluated:
    • 1-skip-bigrams and 1-skip-trigrams (ie. regular bi-/trigrams plus bi-/trigrams with a one-character gap)
    • 1-skip-bigrams and 2-skip-trigrams (the same, but two gaps are allowd for trigrams)
  • separate figures are computed with duplicates removed (distinct mentions)

Results

All fractional numbers denote coverage.

Candidates 2,1- and 3,1-grams 2,1- and 3,2-grams 2,1- and 3,1-grams (distinct) 2,1- and 3,2-grams (distinct)
2 0.726 0.731 0.515 0.518
4 0.751 0.750 0.550 0.550
8 0.768 0.764 0.587 0.585
16 0.790 0.792 0.626 0.628
32 0.807 0.807 0.665 0.666
64 0.825 0.824 0.702 0.702
128 0.842 0.842 0.737 0.739

Raw data and script

Phrase-embedding similarity

Setup

Same data as above.

  • pretrained word embeddings provided by Haodi Li et al. 2017.
  • phrase vectors are obtained by summing or averaging over the word vectors of all words in a mention/dictionary entry
  • candidates are generated with gensim's KeyedVectors.most_similar method

Results

All fractional numbers denote coverage.
"sum" and "mean" are two different strategies for combining word vectors into phrase vectors.

Candidates sum mean sum (distinct) mean (distinct)
2 0.375 0.396 0.345 0.346
4 0.407 0.436 0.392 0.398
8 0.485 0.512 0.447 0.450
16 0.537 0.546 0.499 0.502
32 0.589 0.591 0.555 0.557
64 0.637 0.639 0.614 0.616
128 0.669 0.667 0.659 0.660
256 0.713 0.715 0.708 0.713
512 0.772 0.774 0.764 0.766
1024 0.817 0.820 0.820 0.826

Raw data and script

Phrase-embedding similarity based on subword-units

Setup

Same as the previous, but the embeddings are trained on subword-units of PubMed abstracts instead of actual words.

Results

All fractional numbers denote coverage.
"sum" and "mean" are two different strategies for combining subword-unit vectors into phrase vectors.

Candidates sum mean sum (distinct) mean (distinct)
2 0.122 0.122 0.112 0.112
4 0.144 0.144 0.140 0.141
8 0.192 0.192 0.179 0.179
16 0.221 0.221 0.218 0.218
32 0.258 0.259 0.267 0.268
64 0.292 0.292 0.309 0.309
128 0.344 0.343 0.368 0.367
256 0.418 0.418 0.425 0.424
512 0.471 0.471 0.493 0.493
1024 0.550 0.550 0.565 0.565

Raw data and (generalized) script

Combined skip-gram and phrase-embedding similarity

Results

Candidates are taken equally from both generators.

Candidates Coverage Coverage (distinct)
2 0.732 0.536
4 0.762 0.581
8 0.795 0.626
16 0.814 0.671
32 0.833 0.718
64 0.852 0.764
128 0.873 0.801
256 0.888 0.830
512 0.905 0.861
1024 0.917 0.891

Raw data and script

Clone this wiki locally