In its current state, fast_sage imposes a single step when accepting a string: turning it into bytes, then tokenizing based on the initial vocab. The only way to enforce pretoken boundaries is through this initial vocab, but can't be done in non-trivial cases like affix splitting (because there's no way to stop existing tokens from crossing an affix boundary) or MWE inclusion (because mw initial tokens can be subsumed in larger sequences).
A good way to solve this would be to overload SageTokenizer.tokenize() to a version which accepts a list of pretokens and enforces their bounds when creating a lattice.
Example (affix splitting):
- pretok rule: "always separate
n't at the end of a word"
- init vocab includes the following:
_is n't 't _i sn this _good .
- input sentence:
this_isn't_good.
- current lattice includes the path
_i => sn => 't
- desired input:
['this', '_is', "n't", "_good", "."]
- desired lattice does not include a path containing the illegal cross-boundary
sn token.
In its current state,
fast_sageimposes a single step when accepting a string: turning it into bytes, then tokenizing based on the initial vocab. The only way to enforce pretoken boundaries is through this initial vocab, but can't be done in non-trivial cases like affix splitting (because there's no way to stop existing tokens from crossing an affix boundary) or MWE inclusion (because mw initial tokens can be subsumed in larger sequences).A good way to solve this would be to overload
SageTokenizer.tokenize()to a version which accepts a list of pretokens and enforces their bounds when creating a lattice.Example (affix splitting):
n'tat the end of a word"_isn't't_isnthis_good.this_isn't_good._i => sn => 't['this', '_is', "n't", "_good", "."]sntoken.