Coverage tests

Cosine similarity

Setup

Terminology: 2012 snapshot of MEDICS (taken from DNorm), convert all names and synonyms to lower-case and map them to all IDs (preferred + alternative)
Corpus: NCBI-disease training set
index the terminology names with simstring
use cosine similarity with varying thresholds to retrieve similarly looking terminology entries for each mention in the corpus (the candidate set fed into the CNN)
if at least one of the candidates links to an ID found in the reference (compound concepts: a link for each component), count this as a match
compute the proportion of matches (coverage) and the mean size (and std.dev.) of the candidate sets for every threshold value
compute separate scores with duplicates removed (distinct mentions)

Results

Threshold	Coverage	Candidates	Coverage (distinct)	Candidates (distinct)
.7	0.739	9.8 (26.8)	0.546	7.25 (19.9)
.6	0.782	27.0 (63.8)	0.643	23.1 (54.1)
.5	0.834	86.1 (194)	0.733	79.5 (169)
.4	0.859	270 (554)	0.796	253 (480)
.3	0.885	696 (1230)	0.853	705 (1080)
.2	0.904	1290 (1800)	0.887	1510 (1670)
.1	0.922	2940 (2740)	0.921	3970 (2570)

Number of mentions: 5145
Number of distinct mentions: 1596

Raw data and script

Absolute skip-gram overlap

Setup

Same as for cosine similarity, but instead of simstring using another candidate retrieval mechanism:

dictionary entries are indexed by their character skip-grams
candidates are generated by picking the top-x entries with the highest skip-gram overlap with a given mention
ties (multiple candidates with the same overlap) are resolved by preferring shorter candidates
seven different values for x (number of candidates per mention) are evaluated
two different skip-gram configurations are evaluated:
- 1-skip-bigrams and 1-skip-trigrams (ie. regular bi-/trigrams plus bi-/trigrams with a one-character gap)
- 1-skip-bigrams and 2-skip-trigrams (the same, but two gaps are allowd for trigrams)
separate figures are computed with duplicates removed (distinct mentions)

Results

All fractional numbers denote coverage.

Candidates	2,1- and 3,1-grams	2,1- and 3,2-grams	2,1- and 3,1-grams (distinct)	2,1- and 3,2-grams (distinct)
2	0.726	0.731	0.515	0.518
4	0.751	0.750	0.550	0.550
8	0.768	0.764	0.587	0.585
16	0.790	0.792	0.626	0.628
32	0.807	0.807	0.665	0.666
64	0.825	0.824	0.702	0.702
128	0.842	0.842	0.737	0.739

Raw data and script

Phrase-embedding similarity

Setup

Same data as above.

pretrained word embeddings provided by Haodi Li et al. 2017.
phrase vectors are obtained by summing or averaging over the word vectors of all words in a mention/dictionary entry
candidates are generated with gensim's KeyedVectors.most_similar method

Results

All fractional numbers denote coverage.
"sum" and "mean" are two different strategies for combining word vectors into phrase vectors.

Candidates	sum	mean	sum (distinct)	mean (distinct)
2	0.375	0.396	0.345	0.346
4	0.407	0.436	0.392	0.398
8	0.485	0.512	0.447	0.450
16	0.537	0.546	0.499	0.502
32	0.589	0.591	0.555	0.557
64	0.637	0.639	0.614	0.616
128	0.669	0.667	0.659	0.660
256	0.713	0.715	0.708	0.713
512	0.772	0.774	0.764	0.766
1024	0.817	0.820	0.820	0.826

Raw data and script

Phrase-embedding similarity based on subword-units

Setup

Same as the previous, but the embeddings are trained on subword-units of PubMed abstracts instead of actual words.

Results

All fractional numbers denote coverage.
"sum" and "mean" are two different strategies for combining subword-unit vectors into phrase vectors.

Candidates	sum	mean	sum (distinct)	mean (distinct)
2	0.122	0.122	0.112	0.112
4	0.144	0.144	0.140	0.141
8	0.192	0.192	0.179	0.179
16	0.221	0.221	0.218	0.218
32	0.258	0.259	0.267	0.268
64	0.292	0.292	0.309	0.309
128	0.344	0.343	0.368	0.367
256	0.418	0.418	0.425	0.424
512	0.471	0.471	0.493	0.493
1024	0.550	0.550	0.565	0.565

Raw data and (generalized) script

Combined skip-gram and phrase-embedding similarity

Results

Candidates are taken equally from both generators.

Candidates	Coverage	Coverage (distinct)
2	0.732	0.536
4	0.762	0.581
8	0.795	0.626
16	0.814	0.671
32	0.833	0.718
64	0.852	0.764
128	0.873	0.801
256	0.888	0.830
512	0.905	0.861
1024	0.917	0.891

Raw data and script

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Coverage tests

Cosine similarity

Setup

Results

Absolute skip-gram overlap

Setup

Results

Phrase-embedding similarity

Setup

Results

Phrase-embedding similarity based on subword-units

Setup

Results

Combined skip-gram and phrase-embedding similarity

Results

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally