-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
TODO
data ingestion needs to be handled with a "generator".
we can do this by chunking the data into groups of 2^20 tokens (i.e. 2^20 numbers, aka 1MB). in the counting phase, we'll need to load each of these chunks once; in the writing phase, we may need to load more, but bounded by O(N)
due to the data splits via the regex, we may need to store some metadata in each of the chunks.
idea: format will be (length of chunk, chunk data...)
i.e. 3 1 5 4 (3 is the length, 1 5 4 are the tokens)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels