Sub file-level parallelization

Currently, searchkit does the work parallelization on the file level, so the smallest unit of work is a "file", but this approach does not scale well when file size distribution is uneven. Consider the following log files:

- a.log (4 KiB)
- b.log (4 GiB)
- c.log (17 KiB)

... and consider that we have 3 execution units (process/thread etc.) for simplicity. The amount of total data is 4G + 21 KiB, but the load distribution will be 4KiB, 4GiB, and 17KiB, which is heavily unbalanced work-wise. Ideally, each execution unit should process ~4G/3 GiB of data, so the load balancing would be even. 

This could be achieved without making major architectural changes if we could make a large file appear as multiple smaller files (pseudofiles). The actual splitting would be done at line-feed marker levels so cross pseudo-file boundary access would not be necessary. 

Note that this might impact the performance for gzip-compressed files since gzip requires file to be decompressed for every file seek operation. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sub file-level parallelization #13

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Sub file-level parallelization #13

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions