Skip to content

Small improvements to readahead #6821

@m09526

Description

@m09526

To work towards making S3 readahead work more effectively in the DataFusion code paths, we want to reduce the number of simultaneous S3 connections it maintains open. This should significantly improve scalability and reduce compaction failures due to the issues described in #6750 and #5777.

This issue is to improve the following in the readahead code:

  • Increase the maximum default allowable "readahead" distance before a stream is closed and re-opened. The current default of 64KiB is somewhat arbitrary and too low.
  • Replace use of shared pointer and mutex for metric logging with a simple atomic counter. This is a code hygiene issue and will not materially affect functionality.
  • When performing a readahead GET request, purge the cache BEFORE removing items from it. This makes the maximum number of live streams in the cache a hard limit, rather than a hint.
  • When purging streams from the cache for a given file, switch to a "least recently used" algorithm, instead of current "earliest file position".

Split from:

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions