-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Currently, searchkit does the work parallelization on the file level, so the smallest unit of work is a "file", but this approach does not scale well when file size distribution is uneven. Consider the following log files:
- a.log (4 KiB)
- b.log (4 GiB)
- c.log (17 KiB)
... and consider that we have 3 execution units (process/thread etc.) for simplicity. The amount of total data is 4G + 21 KiB, but the load distribution will be 4KiB, 4GiB, and 17KiB, which is heavily unbalanced work-wise. Ideally, each execution unit should process ~4G/3 GiB of data, so the load balancing would be even.
This could be achieved without making major architectural changes if we could make a large file appear as multiple smaller files (pseudofiles). The actual splitting would be done at line-feed marker levels so cross pseudo-file boundary access would not be necessary.
Note that this might impact the performance for gzip-compressed files since gzip requires file to be decompressed for every file seek operation.