Skip to content

Define the ML task #6

@zermelozf

Description

@zermelozf

Hi @nihonlanguageprocessing @artificialfintelligence @JDryv @stephanyvargas

We need to define the machine learning task we will be solving.
There are a few options; here are my thoughts.

Looking at an excerpt of SEMCOR data we see that the input is a sequence of tokens <wf>, and the prediction is a bunch of attributes lemma, pos, instance_id. I believe instance_id corresponds to a specific entry in the 国語辞典 dictionary. I'd like

<sentence id="d000.s00000">
<wf lemma="" pos="">ここ</wf>
<wf lemma="" pos="">に</wf>
<wf lemma="" pos="">ラマイム</wf>
<wf lemma="" pos="">の</wf>
<wf lemma="" pos="">子</wf>
<wf lemma="" pos="">サレトマ</wf>
<wf lemma="" pos="">という</wf>
<wf lemma="" pos="">者'</wf>
<wf lemma="" pos="">が</wf>
<instance id="bn:00083184v" lemma="いる" pos="VERB">いる</instance>
<wf lemma="" pos="">.</wf>
</sentence>

I would like to use JMDict as our base dictionary instead of the 国語辞典, since it is available under CC license. Hence we need to convert the instance id bn:00083184v for いる into the ent_seq identifier for the same entry in JMDict: https://lindict.api.linalgo.com/v1/ja/entries/69fa42cb-12d9-4eab-9080-5fd5667e069e/.

Image

Also, we'll need to disambiguate between different candidate entries. For example, for いる there are 10 possible candidates: https://lindict.api.linalgo.com/v1/ja/search/?query=いる

Image

Some additional remarks:

  • I am unsure how semcor data was tokenized. I think we need to make sure tokenization is consistent with the method we are using.
  • This could be cast as a seq to seq task. In fact, right now I am writing a parser that returns two sequences as shown below:
Image

Please let me know your thoughts. In particular, @nihonlanguageprocessing do you know more about the base SEMCOR task and how it is solved by current state of the art models?

Metadata

Metadata

Assignees

Labels

help wantedExtra attention is needed

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions