Skip to content

[draft] data ingestion module #4

@clay-arras

Description

@clay-arras

TODO

data ingestion needs to be handled with a "generator".
we can do this by chunking the data into groups of 2^20 tokens (i.e. 2^20 numbers, aka 1MB). in the counting phase, we'll need to load each of these chunks once; in the writing phase, we may need to load more, but bounded by O(N)

due to the data splits via the regex, we may need to store some metadata in each of the chunks.
idea: format will be (length of chunk, chunk data...)
i.e. 3 1 5 4 (3 is the length, 1 5 4 are the tokens)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions