ConceptNetNumberbatch word embeddings support#14
ConceptNetNumberbatch word embeddings support#14zgornel wants to merge 5 commits intoJuliaText:masterfrom
Conversation
|
Thanks for this. I think we might want to add more smarts to the return type. I do not like the use of It is also clear to me that as we add more embedding types, |
|
Thanks for the feedback. A few remarks:
|
I am not sure what you mean, the tests delete their downloads automatically.
Yes, ideally, we would just test on mini-datasets. |
On many unix-like systems |
|
@zgornel Thanks! I'll start taking a look at this too.
Can you clarify what you mean by this? My understanding is that while it's quite possible to use the fasttext library to train a classifier for a language identification task (like they show here), the pretrained fasttext embeddings themselves are all monolingual- i.e. each language is trained separately and the embedding space is not shared among languages, with any OOV interpolation also being language-specific as it is computed from subword char ngrams. Maybe I'm missing your point, though.
I agree. But to me, it seems like this is a separate feature that this PR doesn't (necessarily) depend on. @oxinabox do you have anything specific in mind already? Otherwise, maybe we should open another issue to discuss what a generic API might look like. |
That is my understanding too.
|
src/conceptnet.jl
Outdated
| cnt = 0 | ||
| indices = Int[] | ||
| for (index, row) in enumerate(data) | ||
| word, embedding = _parseline(row) |
There was a problem hiding this comment.
I think you can probably get a small efficiency gain here if you wait to actually parse the rest of the line as floats until you know that you are looking at a "keep word" (inside the if).
src/conceptnet.jl
Outdated
| open(file, "r") do fid | ||
| vocab_size, vector_size = map(x->parse(Int,x), split(readline(fid))) | ||
| max_stored_vocab_size = _get_vocab_size(vocab_size, max_vocab_size) | ||
| data = readlines(fid) |
There was a problem hiding this comment.
readlines loads the whole file into memory. I think it would be better to remove this line and iterate through the file with enumerate(eachline(fid)) instead.
That's it, I was referring to the pretrained model which can be downloaded here. Since the multilingual conceptnet file uses a word of the form @oxinabox I was not aware that Languages.jl has language identification, that's great. |
8b57559 to
9d86441
Compare
This pull adds support for ConceptNetNumberbatch. Three distinct files formats are available and supported:
.txtfile, word and embeddings on each line.txtfile, word and embeddings on each lineVector{Int8}Conceptnet word keys for the multilingual datasets are of the form
/c/<language>/wordwhich makes direct acces a bit unwieldy and searching for example forwordfails. Also, misspellings i.e.word.,worddfails as well. A more heuristic method of retrieving the best match would be advised at this point :)