-
Notifications
You must be signed in to change notification settings - Fork 0
Define the ML task #6
Description
Hi @nihonlanguageprocessing @artificialfintelligence @JDryv @stephanyvargas
We need to define the machine learning task we will be solving.
There are a few options; here are my thoughts.
Looking at an excerpt of SEMCOR data we see that the input is a sequence of tokens <wf>, and the prediction is a bunch of attributes lemma, pos, instance_id. I believe instance_id corresponds to a specific entry in the 国語辞典 dictionary. I'd like
<sentence id="d000.s00000">
<wf lemma="" pos="">ここ</wf>
<wf lemma="" pos="">に</wf>
<wf lemma="" pos="">ラマイム</wf>
<wf lemma="" pos="">の</wf>
<wf lemma="" pos="">子</wf>
<wf lemma="" pos="">サレトマ</wf>
<wf lemma="" pos="">という</wf>
<wf lemma="" pos="">者'</wf>
<wf lemma="" pos="">が</wf>
<instance id="bn:00083184v" lemma="いる" pos="VERB">いる</instance>
<wf lemma="" pos="">.</wf>
</sentence>I would like to use JMDict as our base dictionary instead of the 国語辞典, since it is available under CC license. Hence we need to convert the instance id bn:00083184v for いる into the ent_seq identifier for the same entry in JMDict: https://lindict.api.linalgo.com/v1/ja/entries/69fa42cb-12d9-4eab-9080-5fd5667e069e/.
Also, we'll need to disambiguate between different candidate entries. For example, for いる there are 10 possible candidates: https://lindict.api.linalgo.com/v1/ja/search/?query=いる
Some additional remarks:
- I am unsure how semcor data was tokenized. I think we need to make sure tokenization is consistent with the method we are using.
- This could be cast as a seq to seq task. In fact, right now I am writing a parser that returns two sequences as shown below:
Please let me know your thoughts. In particular, @nihonlanguageprocessing do you know more about the base SEMCOR task and how it is solved by current state of the art models?