Allow pretokenization interface

In its current state, `fast_sage` imposes a single step when accepting a string: turning it into bytes, then tokenizing based on the initial vocab. The only way to enforce pretoken boundaries is through this initial vocab, but can't be done in non-trivial cases like affix splitting (because there's no way to stop existing tokens from crossing an affix boundary) or MWE inclusion (because mw initial tokens can be subsumed in larger sequences).

A good way to solve this would be to overload `SageTokenizer.tokenize()` to a version which accepts a list of pretokens and enforces their bounds when creating a lattice.

Example (affix splitting):
- pretok rule: "always separate `n't` at the end of a word"
- init vocab includes the following: `_is` `n't` `'t` `_i` `sn` `this` `_good` `.`
- input sentence: `this_isn't_good.`
- current lattice includes the path `_i => sn => 't`
- desired input: `['this', '_is', "n't", "_good", "."]`
- desired lattice does not include a path containing the illegal cross-boundary `sn` token.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow pretokenization interface #13

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Allow pretokenization interface #13

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions