datapath: cache optimization for 12% throughput improvement by rjarry · Pull Request #495 · DPDK/grout

rjarry · 2026-02-03T15:19:54Z

This series reduces L1 and LLC cache misses in the datapath hot paths, achieving a 12% throughput improvement (11.6M to 13.0M pkt/s) in bi-directional forwarding benchmarks.

The first two commits are minor cleanups that prepare for the optimizations. The third commit replaces the linked list of interface types with a static array, eliminating pointer chasing in iface_type_get() which is called from the eth_input hot path. This alone reduces LLC cache misses by 9% and increases throughput by 1.7%.

The fourth commit batches interface statistics updates. Instead of writing to the shared iface_stats structure for every packet, counters are accumulated locally and flushed only when the interface changes or at batch end. This reduces LLC cache misses by another 11%.

The fifth commit moves RCU quiescent state reporting from every graph walk iteration to the housekeeping interval that runs every 256 iterations. This reduces memory barrier frequency and cuts cache misses by 7%.

The final and most impactful commit replaces per-packet rte_node_enqueue_x1() calls with a batching scheme. For homogeneous traffic where all packets go to the same edge, the destination node structure is now touched once per batch instead of once per packet. This increases throughput by 10.2% and reduces cycles per packet in output nodes by up to 37%.

Detailed analysis with perf annotate observations and per-node cycle counts

# Optimization Benchmark Analysis

This document presents the performance analysis of the patch series on the
optimizations branch targeting L1 and LLC cache miss reduction.

Test Environment

Each benchmark was run with:

perf record -g -e L1-dcache-load-misses,LLC-load-misses --call-graph dwarf -C 10 -- sleep 10
Bi-directional traffic at maximum sustainable rate before packet drops

Summary

#	Commit	Description	Rate	Change	LLC Misses
1	`5ead471`	Baseline (main)	11.6M pkt/s	-	3,510,945
2	`82ac53b`	iface: store types in static array	11.8M pkt/s	+1.7%	3,198,012
3	`ff88d80`	infra: accumulate interface stats	11.8M pkt/s	+0%	2,861,229
4	`de9f849`	infra: move RCU quiescent to housekeeping	11.8M pkt/s	+0%	2,674,026
5	`c1f9b27`	datapath: batch packet enqueue	13.0M pkt/s	+10.2%	2,910,675

Total improvement: 11.6M to 13.0M pkt/s (+12.1%)

Detailed Analysis

iface: store types in a static array

Change: Replaced linked list traversal with static array lookup for interface type resolution.

Before:

static STAILQ_HEAD(, iface_type) types = STAILQ_HEAD_INITIALIZER(types);

const struct iface_type *iface_type_get(gr_iface_type_t type_id) {
    struct iface_type *t;
    STAILQ_FOREACH (t, &types, next)
        if (t->id == type_id)
            return t;
    return errno_set_null(ENODEV);
}

After:

static const struct iface_type *iface_types[UINT_NUM_VALUES(gr_iface_type_t)];

const struct iface_type *iface_type_get(gr_iface_type_t type_id) {
    return iface_types[type_id];
}

Impact: LLC misses reduced by 9% (3.51M to 3.20M) and throughput increased from 11.6M to 11.8M pkt/s (+1.7%). The O(n) linked list traversal touched multiple cache lines for pointer chasing. The O(1) array lookup touches a single cache line. This benefits iface_get_eth_addr() called in the eth_input hot path. The eth_input node improved from 48.1 to 48.0 cycles/pkt, which was enough to cross the threshold for the next rate level.

infra: accumulate interface stats before flushing

Change: Instead of updating the shared iface_stats structure for every packet, accumulate counts in local variables and flush only when the interface changes or at batch end.

Impact: LLC misses reduced by 11% (3.20M to 2.86M). For homogeneous traffic where all packets in a batch belong to the same interface, the shared stats structure is accessed once instead of N times, reducing store buffer pressure.

Why no throughput increase: The zero-drop-rate remains at 11.8M pkt/s because ip_input dominates the critical path at 88 cycles/pkt and is unaffected by this change. The eth_input node improved from 48.0 to 46.2 cycles/pkt (-4%), but this doesn't move the bottleneck. The cache miss reduction creates headroom for the later batch enqueue optimization.

infra: move RCU quiescent state report to housekeeping interval

Change: Move rte_rcu_qsbr_quiescent() from every graph walk iteration into the housekeeping block that runs every 256 iterations.

Impact: LLC misses reduced by 7% (2.86M to 2.67M) and L1 misses reduced by 6% (2.14B to 2.01B). The memory barriers associated with RCU quiescent state reporting were being executed on every loop iteration. Moving them to the housekeeping interval reduces barrier frequency by 256x while still ensuring timely grace period completion.

The throughput remains at 11.8M pkt/s as ip_input (88 cycles/pkt) still dominates the critical path.

datapath: batch packet enqueue to reduce cache misses

Change: Replaced individual rte_node_enqueue_x1() calls with a batching scheme using helper macros:

#define NODE_ENQUEUE_NEXT(graph, node, objs, i, edge)
    if (last_edge == RTE_EDGE_ID_INVALID) {
        last_edge = edge;
    } else if (edge != last_edge) {
        rte_node_enqueue(graph, node, last_edge, &objs[run_start], i - run_start);
        run_start = i;
        last_edge = edge;
    }

#define NODE_ENQUEUE_FLUSH(graph, node, objs, count)
    if (run_start == 0 && count != 0) {
        rte_node_next_stream_move(graph, node, last_edge);
    } else if (run_start < count) {
        rte_node_enqueue(graph, node, last_edge, &objs[run_start], count - run_start);
    }

Impact: Throughput increased by 10.2% (11.8M to 13.0M pkt/s). This is the most significant optimization.

The per-packet rte_node_enqueue_x1() accessed the destination node's rte_node structure (specifically idx and objs fields) once per packet. Even when all packets go to the same edge, touching these fields N times instead of once causes cache line bouncing.

The new scheme:

Tracks runs of consecutive packets going to the same edge
When edge changes, flushes previous run using rte_node_enqueue() (bulk copy)
When all packets go to same edge (common case), uses rte_node_next_stream_move() which swaps pointers and updates idx once

For homogeneous traffic (typical case), the destination node structure is touched once per batch instead of once per packet.

Perf Annotate Observations

Baseline port_output_process

Cache misses concentrated at destination node structure accesses:

16.50%: testw   %bx, %bx           ; Check node->idx
 9.21%: movq    %r12, (%rax,%rdx,8); Write to node->objs[idx]
 8.69%: movzwl  0x150(%r15), %edx  ; Read node->idx

Optimized port_output_process

Cache miss distribution shifted to edge detection and bulk operations:

22.66%: jne     0x4bf367           ; Edge comparison branch
17.32%: movzwl  0x46(%r15), %eax   ; Read port queue mapping
15.68%: movq    0x8(%rsp), %rax    ; Loop control

The per-packet node structure access pattern is eliminated.

Cycles per Packet Comparison

Node statistics show reduced cycles/pkt for output nodes:

Node	Baseline	Final	Reduction
port_output	18.9	11.9	37%
iface_output	24.2	17.0	30%
ip_output	44.4	38.4	14%
eth_output	45.5	37.3	18%
ip_forward	19.9	15.5	22%

Conclusion

The patch series achieves a 12.1% throughput improvement. The optimizations with clear measurable impact are:

Static array for interface types (+1.7% throughput, -9% LLC misses): Eliminates pointer chasing in hot path lookups.
Interface stats batching (-11% LLC misses): Reduces per-packet writes to shared memory, though throughput was bottleneck-limited elsewhere.
RCU quiescent movement (-7% LLC misses, -6% L1 misses): Reduces memory barrier frequency by moving RCU reporting to housekeeping interval.
Batch packet enqueue (+10.2% throughput, -37% cycles in port_output): The largest improvement, transforming O(n) node structure accesses into O(1) for homogeneous traffic.

Replace hardcoded array sizes and *_COUNT enum values with UINT_NUM_VALUES() macro for all edge registration arrays. This makes the array sizing explicit based on the index type and eliminates tautological comparison warnings. Update registration functions to use name-based validation instead of bounds checking where applicable. Use gr_iface_type_name(), gr_iface_mode_name(), and gr_nh_type_name() for better log messages and consistent error reporting. Signed-off-by: Robin Jarry <rjarry@redhat.com>

The interface type name is already available via gr_iface_type_name() which derives it from the type ID. Storing it again in struct iface_type is redundant. This field is unused anyway. Remove the name field from iface_type. Signed-off-by: Robin Jarry <rjarry@redhat.com>

Replace the linked list of interface types with a static array indexed by type ID. This makes iface_type_get() an O(1) array lookup instead of O(n) linked list traversal. This benefits iface_get_eth_addr() which is called in the eth_input hot path and needs to look up the interface type to find the get_eth_addr callback. Since interface types are no longer linked, the STAILQ_ENTRY field is removed and the iface_type structs can be declared const. Substituting pointer chasing with direct array indexing reduces LLC cache misses by 9% and increases throughput from 11.6M to 11.8M pkt/s (+1.7%). Signed-off-by: Robin Jarry <rjarry@redhat.com>

Instead of updating the shared iface_stats structure for every packet, accumulate packets and bytes in local variables and flush to the stats structure only when the interface changes or at the end of the batch. This reduces memory writes from O(n) to O(number of unique interfaces) per batch, which is typically O(1) for homogeneous traffic. The shared stats structure is accessed less frequently, reducing cache line pressure and store buffer usage. Reduces LLC cache misses by 11% (3.20M to 2.86M) and eth_input cycles from 48.0 to 46.2 per packet (-4%). Throughput remains at 11.8M pkt/s as ip_input (88 cycles/pkt) dominates the critical path. Signed-off-by: Robin Jarry <rjarry@redhat.com>

RCU quiescent state only needs to be reported periodically to signal that the thread has passed through a quiescent period. Reporting it on every graph walk iteration is unnecessarily frequent and adds overhead from the memory barriers involved. Move rte_rcu_qsbr_quiescent() from the main loop body into the housekeeping block that runs every 256 iterations. This reduces the frequency of memory barriers while still ensuring timely RCU grace period completion. Reduces LLC cache misses by 7% (2.86M to 2.67M) and L1 cache misses by 6% (2.14B to 2.01B). Throughput remains at 11.8M pkt/s as ip_input dominates the critical path. Signed-off-by: Robin Jarry <rjarry@redhat.com>

Replace individual rte_node_enqueue_x1() calls with a batching scheme. Each rte_node_enqueue_x1() call accesses the destination node's rte_node structure to append one packet. Even when all packets go to the same edge, touching these fields once per packet instead of once per batch causes unnecessary cache pressure. Introduce two helper macros: - NODE_ENQUEUE_NEXT: tracks runs of consecutive packets going to the same edge. When the edge changes, flush the previous run using rte_node_enqueue() which copies mbuf pointers in bulk. - NODE_ENQUEUE_FLUSH: at end of loop, if all packets went to the same edge (common case), use rte_node_next_stream_move() which swaps pointers and updates idx once. Otherwise flush the final run. These macros are not used in port_rx (use stream move directly), control_input (source node), control_output and port_tx (packets consumed directly). Increases throughput by 10.2% (11.8M to 13.0M pkt/s). Cycles per packet reduced significantly in output nodes: port_output -37%, iface_output -30%, ip_forward -22%, eth_output -18%, ip_output -14%. Signed-off-by: Robin Jarry <rjarry@redhat.com>

The rx_process function was checking the ether_type of every received packet to detect LACP slow protocol frames (0x8809). This check is only relevant for bonded interfaces, yet it was performed unconditionally on all ports. Move the ether_type check outside the main loop and only execute it when the interface mode is GR_IFACE_MODE_BOND. For non-bonded ports, the loop now simply stores the iface pointer without accessing packet data. This reduces L1 cache misses by 6% (2.26B to 2.13B) and port_rx node cycles per packet by 10% (47.0 to 42.1). The perf annotate data shows that the ether_type comparison accounted for 59% of L1 cache misses within rx_process. Eliminating this access for non-bonded ports cuts rx_process cache miss samples by 61%. Signed-off-by: Robin Jarry <rjarry@redhat.com>

rjarry added 3 commits February 3, 2026 14:51

rjarry force-pushed the optimizations branch from 817e31a to 42b1a9b Compare February 3, 2026 15:22

This comment was marked as resolved.

Sign in to view

rjarry force-pushed the optimizations branch from 42b1a9b to 555d413 Compare February 3, 2026 15:35

rjarry added 3 commits February 3, 2026 16:36

rjarry force-pushed the optimizations branch from 555d413 to 1ded0fd Compare February 3, 2026 15:36

rjarry force-pushed the optimizations branch from ab0e5d8 to 6ccca8b Compare February 3, 2026 16:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datapath: cache optimization for 12% throughput improvement#495

datapath: cache optimization for 12% throughput improvement#495
rjarry wants to merge 7 commits intoDPDK:mainfrom
rjarry:optimizations

rjarry commented Feb 3, 2026 •

edited

Loading

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rjarry commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Environment

Summary

Detailed Analysis

iface: store types in a static array

infra: accumulate interface stats before flushing

infra: move RCU quiescent state report to housekeeping interval

datapath: batch packet enqueue to reduce cache misses

Perf Annotate Observations

Baseline port_output_process

Optimized port_output_process

Cycles per Packet Comparison

Conclusion

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rjarry commented Feb 3, 2026 •

edited

Loading