[draft] data ingestion module

TODO

data ingestion needs to be handled with a "generator". 
we can do this by chunking the data into groups of 2^20 tokens (i.e. 2^20 numbers, aka 1MB). in the counting phase, we'll need to load each of these chunks once; in the writing phase, we may need to load more, but bounded by O(N)

due to the data splits via the regex, we may need to store some metadata in each of the chunks.
idea: format will be (length of chunk, chunk data...)
i.e. 3 1 5 4 (3 is the length, 1 5 4 are the tokens)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[draft] data ingestion module #4

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[draft] data ingestion module #4

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions