Skip to content

feat(io_uring): Add support for registered buffers#72

Open
kavirajk wants to merge 8 commits into
ClickHouse:mainfrom
kavirajk:feat/registered-buffers
Open

feat(io_uring): Add support for registered buffers#72
kavirajk wants to merge 8 commits into
ClickHouse:mainfrom
kavirajk:feat/registered-buffers

Conversation

@kavirajk

@kavirajk kavirajk commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Changes

Added following new apis to the scheduler

  1. registerBuffers()
  2. readFixed()
  3. writeFixed()

Which internally using liburing helpers io_uring_register_buffers(), io_uring_prep_read_fixed(), io_uring_prep_write_fixed().

This let us register the pre-allocated buffers that iouring can use during IO operations rather then allocating it per-io.

This is mainly based on best practices learned from TUM DBMS paper
https://arxiv.org/pdf/2512.04859

I also integrated with file-perf benchmark. And the numbers looked promising. See below for the actual improvement numbers

Performance

My setup is m8id.8xlarge EC2 instance

Normal

ec2-dev$ ./bb -b release perf --duration 60s --warmup 10s file
2026-06-12 21:48:21.567 [INFO ] bb:1986: command=perf preset=release
[0/2] Re-checking globbed directories...
ninja: no work to do.

## file-perf -- async file I/O

file=/dev/shm/file-perf.bin, bs=4k, size=1g, duration=60s, warmup=10s

| numjobs  | iodepth  | mode       | IOPS     | BW         | avg      | p50      | p95      | p99      | p99.9    |
|----------|----------|------------|----------|------------|----------|----------|----------|----------|----------|
| 1        | 1        | randwrite  | 185k     | 721.0 MiB/s | 5.39 µs  | 3.06 µs  | 12.83 µs | 13.57 µs | 22.39 µs |
| 1        | 16       | randwrite  | 575k     | 2245.0 MiB/s | 27.81 µs | 26.82 µs | 36.7 µs  | 41.02 µs | 54.04 µs |
| 16       | 1        | randwrite  | 893k     | 3489.0 MiB/s | 17.89 µs | 18.79 µs | 26.6 µs  | 37.29 µs | 52.2 µs  |
| 16       | 16       | randwrite  | 805k     | 3143.0 MiB/s | 318.16 µs | 262.8 µs | 850.48 µs | 1333.68 µs | 1847.18 µs |
| 1        | 1        | randread   | 232k     | 906.0 MiB/s | 4.29 µs  | 2.42 µs  | 12.35 µs | 12.99 µs | 20.79 µs |
| 1        | 16       | randread   | 682k     | 2665.0 MiB/s | 23.43 µs | 25.55 µs | 29.8 µs  | 38.12 µs | 53.12 µs |
| 16       | 1        | randread   | 2663k    | 10404.0 MiB/s | 5.98 µs  | 3.91 µs  | 15.24 µs | 29.43 µs | 98.45 µs |
| 16       | 16       | randread   | 4955k    | 19355.0 MiB/s | 51.63 µs | 50.92 µs | 80.68 µs | 99.89 µs | 126.67 µs |

Fixed Buffers

ec2-dev$ ./bb -b release perf --duration 60s --warmup 10s file --fixed-buffers
2026-06-12 21:58:31.006 [INFO ] bb:1986: command=perf preset=release
[0/2] Re-checking globbed directories...
ninja: no work to do.

## file-perf -- async file I/O

file=/dev/shm/file-perf.bin, bs=4k, size=1g, duration=60s, warmup=10s

| numjobs  | iodepth  | mode       | IOPS     | BW         | avg      | p50      | p95      | p99      | p99.9    |
|----------|----------|------------|----------|------------|----------|----------|----------|----------|----------|
| 1        | 1        | randwrite  | 193k     | 754.0 MiB/s | 5.15 µs  | 2.79 µs  | 12.83 µs | 13.65 µs | 22.48 µs |
| 1        | 16       | randwrite  | 608k     | 2373.0 MiB/s | 26.31 µs | 25.26 µs | 34.84 µs | 40.16 µs | 53.23 µs |
| 16       | 1        | randwrite  | 1117k    | 4362.0 MiB/s | 14.3 µs  | 14.89 µs | 23.68 µs | 29.79 µs | 39.66 µs |
| 16       | 16       | randwrite  | 1040k    | 4063.0 MiB/s | 246.11 µs | 222.31 µs | 505.13 µs | 901.61 µs | 1277.51 µs |
| 1        | 1        | randread   | 236k     | 924.0 MiB/s | 4.21 µs  | 2.38 µs  | 12.33 µs | 12.97 µs | 20.76 µs |
| 1        | 16       | randread   | 694k     | 2711.0 MiB/s | 23.03 µs | 25.11 µs | 29.24 µs | 37.22 µs | 52.57 µs |
| 16       | 1        | randread   | 2710k    | 10587.0 MiB/s | 5.88 µs  | 3.8 µs   | 15.16 µs | 28.89 µs | 100.26 µs |
| 16       | 16       | randread   | 5495k    | 21465.0 MiB/s | 46.56 µs | 45.05 µs | 71.7 µs  | 90.96 µs | 120.31 µs |

Throughput diff (normal vs fixed buffers)


| numjobs | iodepth | mode | IOPS before | IOPS after | Δ | BW before | BW after | Δ |
|---------|---------|-----------|-------------|------------|---------|-----------|----------|---------|
| 1       | 1       | randwrite | 185k        | 193k       | +4.3%   | 721.0     | 754.0    | +4.6%   |
| 1       | 16      | randwrite | 575k        | 608k       | +5.7%   | 2245.0    | 2373.0   | +5.7%   |
| 16      | 1       | randwrite | 893k        | 1117k      | +25.1%  | 3489.0    | 4362.0   | +25.0%  |
| 16      | 16      | randwrite | 805k        | 1040k      | +29.2%  | 3143.0    | 4063.0   | +29.3%  |
| 1       | 1       | randread  | 232k        | 236k       | +1.7%   | 906.0     | 924.0    | +2.0%   |
| 1       | 16      | randread  | 682k        | 694k       | +1.8%   | 2665.0    | 2711.0   | +1.7%   |
| 16      | 1       | randread  | 2663k       | 2710k      | +1.8%   | 10404.0   | 10587.0  | +1.8%   |
| 16      | 16      | randread  | 4955k       | 5495k      | +10.9%  | 19355.0   | 21465.0  | +10.9%  |



Latency diff (normal vs fixed buffers)

| numjobs | iodepth | mode      | avg                     | p50                     | p95                      | p99                       | p99.9                     |
|---------|---------|-----------|-------------------------|-------------------------|--------------------------|---------------------------|---------------------------|
| 1       | 1       | randwrite | 5.39→5.15 (−4.5%)       | 3.06→2.79 (−8.8%)       | 12.83→12.83 (0.0%)       | 13.57→13.65 (+0.6%)       | 22.39→22.48 (+0.4%)       |
| 1       | 16      | randwrite | 27.81→26.31 (−5.4%)     | 26.82→25.26 (−5.8%)     | 36.7→34.84 (−5.1%)       | 41.02→40.16 (−2.1%)       | 54.04→53.23 (−1.5%)       |
| 16      | 1       | randwrite | 17.89→14.3 (−20.1%)     | 18.79→14.89 (−20.8%)    | 26.6→23.68 (−11.0%)      | 37.29→29.79 (−20.1%)      | 52.2→39.66 (−24.0%)       |
| 16      | 16      | randwrite | 318.16→246.11 (−22.6%)  | 262.8→222.31 (−15.4%)   | 850.48→505.13 (−40.6%)   | 1333.68→901.61 (−32.4%)   | 1847.18→1277.51 (−30.8%)  |
| 1       | 1       | randread  | 4.29→4.21 (−1.9%)       | 2.42→2.38 (−1.7%)       | 12.35→12.33 (−0.2%)      | 12.99→12.97 (−0.2%)       | 20.79→20.76 (−0.1%)       |
| 1       | 16      | randread  | 23.43→23.03 (−1.7%)     | 25.55→25.11 (−1.7%)     | 29.8→29.24 (−1.9%)       | 38.12→37.22 (−2.4%)       | 53.12→52.57 (−1.0%)       |
| 16      | 1       | randread  | 5.98→5.88 (−1.7%)       | 3.91→3.8 (−2.8%)        | 15.24→15.16 (−0.5%)      | 29.43→28.89 (−1.8%)       | 98.45→100.26 (+1.8%)      |
| 16      | 16      | randread  | 51.63→46.56 (−9.8%)     | 50.92→45.05 (−11.5%)    | 80.68→71.7 (−11.1%)      | 99.89→90.96 (−8.9%)       | 126.67→120.31 (−5.0%)     |

kavirajk added 3 commits June 12, 2026 22:03
Add support for new apis to scheduler

1. register_buffers
2. read_fixed()
3. write_fixe()

This let us register the pre-allocated buffers that iouring can use
during IO operations rather then allocating it per-io.

This is mainly based on best practices learned from TUM DBMS paper
https://arxiv.org/pdf/2512.04859

Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>
Add a flag to run file-perf with register buffer iouring api

```
./bb -b release perf --duration 60s --warmup 10s file --fixed-buffers
```

The numbers looks super interesting. So worth adding it to upstream
Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>
`./bb fmt`

Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>

@vadimskipin vadimskipin left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! I thought about this optimization but have not try. Results are really impressive.

Comment thread src/fibers/fiber.cpp Outdated

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds io_uring “fixed/registered buffer” support to FiberScheduler and integrates it into the file-perf benchmark to avoid per-IO buffer pinning/allocation overhead.

Changes:

  • Added FiberScheduler::registerBuffers(), readFixed(), and writeFixed() APIs backed by liburing helpers.
  • Updated file-perf to optionally register per-job buffers and issue IORING_OP_READ_FIXED / IORING_OP_WRITE_FIXED via a new --fixed-buffers flag.
  • Extended the bb perf runner to pass through --fixed-buffers for file-perf runs.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
src/perf/file-perf.cpp Adds a --fixed-buffers mode that registers per-job buffers and switches IO submission to fixed-buffer ops.
src/fibers/fiber.cpp Implements readFixed, writeFixed, and registerBuffers on the scheduler’s per-CPU rings.
include/silk/fibers/fiber.h Exposes and documents the new fixed-buffer APIs on the public scheduler interface.
bb Adds CLI plumbing to enable fixed-buffer mode for file-perf via bb perf and bb file-perf.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/fibers/fiber.cpp Outdated
Comment thread src/fibers/fiber.cpp Outdated
Comment thread src/fibers/fiber.cpp
Comment thread include/silk/fibers/fiber.h
Comment thread src/fibers/fiber.cpp
{
continue;
}
int r = ::io_uring_register_buffers(&processor->ring, iovecs, count);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is about NUMA awareness here? Allocate once and use on any CPU does not look optimal. Should we maintain separate buffers per-node (per-cpu)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point 👍 This may require some changes on all three new apis (read_fixed/write_fixed/register_buffers) I think. Currently the buffers can physically live on one node and cores on other nodes have to pay remote-memory cost.

I'm thinking of having something simple struct

  struct FixedBuf
  {
      void *   base[SILK_MAX_NUMA_NODES];  // node-local pinned bases
      uint32_t index;                      // node-relative index
      uint32_t len;
  };

and make read_fixed and write_fixed apis accepts this FixedBuffer along with offset instead of plain void * pointer.

what do you think? May be it's complex? open to other ideas if you got any simpler approach (I'm not super familiar with NUMA in general :) )

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems silk just need to expose raw register-buffer API. It would be better to write client code first and then decide what can be pushed into silk.

kavirajk added 4 commits June 14, 2026 21:20
Changes
1. Make sure the readFixed api on the registered buffer is checked by
   msan for uninitialized memory (similar to readv api)
2. Fix the nbytes len field (uint64_t -> uint32_t) because that's the
   underlying io_uring_* api expects
3. Add a round trip test for new api

Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>
Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>
Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>
Document the new apis in corresponding docs

Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>
@kavirajk kavirajk marked this pull request as ready for review June 14, 2026 23:03
@kavirajk kavirajk requested a review from vadimskipin June 14, 2026 23:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants