Made off-by-one adjustments for specials tokens by agtsai-i · Pull Request #41 · cemoody/lda2vec

agtsai-i · 2016-07-12T20:58:08Z

preprocess.tokenize() pads texts with -2 (the SKIP index), which puts it in the corpus vocabulary and counts_loose.

_loose_keys_ordered() then prepends the specials tokens (OOV and SKIP) while making keys_loose, thus allocating two array entries to SKIP (instead of 1 as desired, I assume).

This becomes a problem when you try to train a model using all of the words in the vocabulary, and in lda2vec_run.py,

model.sampler.W.data[:, :] = vectors[:n_vocab, :]

W is created with one more row than there are unique words + specials, since n_keys is derived from the concatenated array length created in _loose_keys_ordered(), and not the unique number of words in the vocabulary as created by counts_loose

Made off-by-one adjustments for specials tokens

bb49fb8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Made off-by-one adjustments for specials tokens#41

Made off-by-one adjustments for specials tokens#41
agtsai-i wants to merge 1 commit intocemoody:masterfrom
agtsai-i:albert-off-by-one

agtsai-i commented Jul 12, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

agtsai-i commented Jul 12, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant