Skip to content

Allow pretokenization interface #13

Description

@yuvalpinter

In its current state, fast_sage imposes a single step when accepting a string: turning it into bytes, then tokenizing based on the initial vocab. The only way to enforce pretoken boundaries is through this initial vocab, but can't be done in non-trivial cases like affix splitting (because there's no way to stop existing tokens from crossing an affix boundary) or MWE inclusion (because mw initial tokens can be subsumed in larger sequences).

A good way to solve this would be to overload SageTokenizer.tokenize() to a version which accepts a list of pretokens and enforces their bounds when creating a lattice.

Example (affix splitting):

  • pretok rule: "always separate n't at the end of a word"
  • init vocab includes the following: _is n't 't _i sn this _good .
  • input sentence: this_isn't_good.
  • current lattice includes the path _i => sn => 't
  • desired input: ['this', '_is', "n't", "_good", "."]
  • desired lattice does not include a path containing the illegal cross-boundary sn token.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions