-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
TODO
use the builtin cpp regex library. the regex pattern will probably just be stolen from a good open source model.
example: for deepseek, https://huggingface.co/deepseek-ai/DeepSeek-R1/raw/main/tokenizer.json
here is the pattern I believe: "[!\"#$%&'()*+,\\-./:;<=>?@\\[\\\\\\]^_`{|}~][A-Za-z]+|[^\r\n\\p{L}\\p{P}\\p{S}]?[\\p{L}\\p{M}]+| ?[\\p{P}\\p{S}]+[\r\n]*|\\s*[\r\n]+|\\s+(?!\\S)|\\s+"
chunking:
- split the text into chunks of 1024 characters
- we process the chunks iteratively. each chunks will be split according to the regex pattern.
- we carry over a chunk from the previous iteration, min(lastSplitChunk, 512)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels