Define the ML task

Hi @nihonlanguageprocessing @artificialfintelligence @JDryv @stephanyvargas 

We need to define the machine learning task we will be solving. 
There are a few options; here are my thoughts.

Looking at an excerpt of SEMCOR data we see that the input is a sequence of tokens `<wf>`, and the prediction is a bunch of attributes `lemma`, `pos`, `instance_id`. I believe  `instance_id` corresponds to a specific entry in the 国語辞典 dictionary. I'd like 

```xml
<sentence id="d000.s00000">
<wf lemma="" pos="">ここ</wf>
<wf lemma="" pos="">に</wf>
<wf lemma="" pos="">ラマイム</wf>
<wf lemma="" pos="">の</wf>
<wf lemma="" pos="">子</wf>
<wf lemma="" pos="">サレトマ</wf>
<wf lemma="" pos="">という</wf>
<wf lemma="" pos="">者'</wf>
<wf lemma="" pos="">が</wf>
<instance id="bn:00083184v" lemma="いる" pos="VERB">いる</instance>
<wf lemma="" pos="">.</wf>
</sentence>
```

I would like to use JMDict as our base dictionary instead of the 国語辞典, since it is available under CC license. Hence we need to convert the instance id  `bn:00083184v` for *いる* into the `ent_seq` identifier for the same entry in JMDict: https://lindict.api.linalgo.com/v1/ja/entries/69fa42cb-12d9-4eab-9080-5fd5667e069e/.

<img width="470" alt="Image" src="https://github.com/user-attachments/assets/32524eda-cf4f-42cc-aa98-c4f60e4ae30d" />  

Also, we'll need to disambiguate between different candidate entries. For example, for いる there are 10 possible candidates:   https://lindict.api.linalgo.com/v1/ja/search/?query=いる

<img width="518" alt="Image" src="https://github.com/user-attachments/assets/5f71e0b1-0ed4-4e9a-ac86-540eb0853a65" />

Some additional remarks:

- I am unsure how semcor data was tokenized. I think we need to make sure tokenization is consistent with the method we are using.
- This could be cast as a seq to seq task. In fact, right now I am writing a parser that returns two sequences as shown below:

<img width="737" alt="Image" src="https://github.com/user-attachments/assets/adbee837-a3d8-425f-b351-0cf11ec84469" />

Please let me know your thoughts. In particular, @nihonlanguageprocessing do you know more about the base SEMCOR task and how it is solved by current state of the art models?




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define the ML task #6

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Define the ML task #6

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions