Task Parsing Part for Pytorch Implementation#552
Task Parsing Part for Pytorch Implementation#552ZhengTang1120 wants to merge 134 commits intomasterfrom
Conversation
|
Looks nice so far! |
also refined some functions in embedding layer
Now, the model initialization part is working
fixed bugs on UNK word embedding set dropout prob to 0.1 add clipping
| w2i = {} | ||
| i = 0 |
There was a problem hiding this comment.
I think that this might help. In the previous version with the head start of i = 1, it seems like the wrong vectors might have been used. If one looked up "," in w2i, it might have been mapped to 2 instead of 1.
There was a problem hiding this comment.
This is because we treated empty string "" and unknown "" differently in the previous version, 0 was token by , and i was starting from 1.
In the current version, the "" and "" share the same embedding, so we do not need an extra id for ""/"".
| else: | ||
| delimiter = " " | ||
| word, *rest = line.rstrip().split(delimiter) | ||
| word = "<UNK>" if word == "" else word |
There was a problem hiding this comment.
IF Python is OK using an empty string as a key, this should not be necessary.
There was a problem hiding this comment.
It is easier to change the key here instead of changing all tokens through the codes...
| emb_dict["<UNK>"] = vector | ||
| else: | ||
| emb_dict[word] = vector | ||
| emb_dict[word] = vector |
There was a problem hiding this comment.
Are two copies of the arrays being kept temporarily: one in emb_dict and another in weights? If memory is an issue, it seems like one could record this vector right away in weights.
There was a problem hiding this comment.
You are right, I will refine this later. Thanks!
@MihaiSurdeanu @bethard @kwalcock
Here is my current code of the MTL in torch. Thanks to Steve, I already fixed few bugs in the code.
This is just the task manager and file reader part, I will do another pull after I get the NER task implemented.