Skip to content

datapath: cache optimization for 12% throughput improvement#495

Open
rjarry wants to merge 7 commits intoDPDK:mainfrom
rjarry:optimizations
Open

datapath: cache optimization for 12% throughput improvement#495
rjarry wants to merge 7 commits intoDPDK:mainfrom
rjarry:optimizations

Conversation

@rjarry
Copy link
Collaborator

@rjarry rjarry commented Feb 3, 2026

This series reduces L1 and LLC cache misses in the datapath hot paths, achieving a 12% throughput improvement (11.6M to 13.0M pkt/s) in bi-directional forwarding benchmarks.

The first two commits are minor cleanups that prepare for the optimizations. The third commit replaces the linked list of interface types with a static array, eliminating pointer chasing in iface_type_get() which is called from the eth_input hot path. This alone reduces LLC cache misses by 9% and increases throughput by 1.7%.

The fourth commit batches interface statistics updates. Instead of writing to the shared iface_stats structure for every packet, counters are accumulated locally and flushed only when the interface changes or at batch end. This reduces LLC cache misses by another 11%.

The fifth commit moves RCU quiescent state reporting from every graph walk iteration to the housekeeping interval that runs every 256 iterations. This reduces memory barrier frequency and cuts cache misses by 7%.

The final and most impactful commit replaces per-packet rte_node_enqueue_x1() calls with a batching scheme. For homogeneous traffic where all packets go to the same edge, the destination node structure is now touched once per batch instead of once per packet. This increases throughput by 10.2% and reduces cycles per packet in output nodes by up to 37%.

Detailed analysis with perf annotate observations and per-node cycle counts # Optimization Benchmark Analysis

This document presents the performance analysis of the patch series on the
optimizations branch targeting L1 and LLC cache miss reduction.

Test Environment

Each benchmark was run with:

  • perf record -g -e L1-dcache-load-misses,LLC-load-misses --call-graph dwarf -C 10 -- sleep 10
  • Bi-directional traffic at maximum sustainable rate before packet drops

Summary

# Commit Description Rate Change LLC Misses
1 5ead471 Baseline (main) 11.6M pkt/s - 3,510,945
2 82ac53b iface: store types in static array 11.8M pkt/s +1.7% 3,198,012
3 ff88d80 infra: accumulate interface stats 11.8M pkt/s +0% 2,861,229
4 de9f849 infra: move RCU quiescent to housekeeping 11.8M pkt/s +0% 2,674,026
5 c1f9b27 datapath: batch packet enqueue 13.0M pkt/s +10.2% 2,910,675

Total improvement: 11.6M to 13.0M pkt/s (+12.1%)

Detailed Analysis

iface: store types in a static array

Change: Replaced linked list traversal with static array lookup for interface type resolution.

Before:

static STAILQ_HEAD(, iface_type) types = STAILQ_HEAD_INITIALIZER(types);

const struct iface_type *iface_type_get(gr_iface_type_t type_id) {
    struct iface_type *t;
    STAILQ_FOREACH (t, &types, next)
        if (t->id == type_id)
            return t;
    return errno_set_null(ENODEV);
}

After:

static const struct iface_type *iface_types[UINT_NUM_VALUES(gr_iface_type_t)];

const struct iface_type *iface_type_get(gr_iface_type_t type_id) {
    return iface_types[type_id];
}

Impact: LLC misses reduced by 9% (3.51M to 3.20M) and throughput increased from 11.6M to 11.8M pkt/s (+1.7%). The O(n) linked list traversal touched multiple cache lines for pointer chasing. The O(1) array lookup touches a single cache line. This benefits iface_get_eth_addr() called in the eth_input hot path. The eth_input node improved from 48.1 to 48.0 cycles/pkt, which was enough to cross the threshold for the next rate level.

infra: accumulate interface stats before flushing

Change: Instead of updating the shared iface_stats structure for every packet, accumulate counts in local variables and flush only when the interface changes or at batch end.

Impact: LLC misses reduced by 11% (3.20M to 2.86M). For homogeneous traffic where all packets in a batch belong to the same interface, the shared stats structure is accessed once instead of N times, reducing store buffer pressure.

Why no throughput increase: The zero-drop-rate remains at 11.8M pkt/s because ip_input dominates the critical path at 88 cycles/pkt and is unaffected by this change. The eth_input node improved from 48.0 to 46.2 cycles/pkt (-4%), but this doesn't move the bottleneck. The cache miss reduction creates headroom for the later batch enqueue optimization.

infra: move RCU quiescent state report to housekeeping interval

Change: Move rte_rcu_qsbr_quiescent() from every graph walk iteration into the housekeeping block that runs every 256 iterations.

Impact: LLC misses reduced by 7% (2.86M to 2.67M) and L1 misses reduced by 6% (2.14B to 2.01B). The memory barriers associated with RCU quiescent state reporting were being executed on every loop iteration. Moving them to the housekeeping interval reduces barrier frequency by 256x while still ensuring timely grace period completion.

The throughput remains at 11.8M pkt/s as ip_input (88 cycles/pkt) still dominates the critical path.

datapath: batch packet enqueue to reduce cache misses

Change: Replaced individual rte_node_enqueue_x1() calls with a batching scheme using helper macros:

#define NODE_ENQUEUE_NEXT(graph, node, objs, i, edge)
    if (last_edge == RTE_EDGE_ID_INVALID) {
        last_edge = edge;
    } else if (edge != last_edge) {
        rte_node_enqueue(graph, node, last_edge, &objs[run_start], i - run_start);
        run_start = i;
        last_edge = edge;
    }

#define NODE_ENQUEUE_FLUSH(graph, node, objs, count)
    if (run_start == 0 && count != 0) {
        rte_node_next_stream_move(graph, node, last_edge);
    } else if (run_start < count) {
        rte_node_enqueue(graph, node, last_edge, &objs[run_start], count - run_start);
    }

Impact: Throughput increased by 10.2% (11.8M to 13.0M pkt/s). This is the most significant optimization.

The per-packet rte_node_enqueue_x1() accessed the destination node's rte_node structure (specifically idx and objs fields) once per packet. Even when all packets go to the same edge, touching these fields N times instead of once causes cache line bouncing.

The new scheme:

  • Tracks runs of consecutive packets going to the same edge
  • When edge changes, flushes previous run using rte_node_enqueue() (bulk copy)
  • When all packets go to same edge (common case), uses rte_node_next_stream_move() which swaps pointers and updates idx once

For homogeneous traffic (typical case), the destination node structure is touched once per batch instead of once per packet.

Perf Annotate Observations

Baseline port_output_process

Cache misses concentrated at destination node structure accesses:

16.50%: testw   %bx, %bx           ; Check node->idx
 9.21%: movq    %r12, (%rax,%rdx,8); Write to node->objs[idx]
 8.69%: movzwl  0x150(%r15), %edx  ; Read node->idx

Optimized port_output_process

Cache miss distribution shifted to edge detection and bulk operations:

22.66%: jne     0x4bf367           ; Edge comparison branch
17.32%: movzwl  0x46(%r15), %eax   ; Read port queue mapping
15.68%: movq    0x8(%rsp), %rax    ; Loop control

The per-packet node structure access pattern is eliminated.

Cycles per Packet Comparison

Node statistics show reduced cycles/pkt for output nodes:

Node Baseline Final Reduction
port_output 18.9 11.9 37%
iface_output 24.2 17.0 30%
ip_output 44.4 38.4 14%
eth_output 45.5 37.3 18%
ip_forward 19.9 15.5 22%

Conclusion

The patch series achieves a 12.1% throughput improvement. The optimizations with clear measurable impact are:

  1. Static array for interface types (+1.7% throughput, -9% LLC misses): Eliminates pointer chasing in hot path lookups.

  2. Interface stats batching (-11% LLC misses): Reduces per-packet writes to shared memory, though throughput was bottleneck-limited elsewhere.

  3. RCU quiescent movement (-7% LLC misses, -6% L1 misses): Reduces memory barrier frequency by moving RCU reporting to housekeeping interval.

  4. Batch packet enqueue (+10.2% throughput, -37% cycles in port_output): The largest improvement, transforming O(n) node structure accesses into O(1) for homogeneous traffic.

Replace hardcoded array sizes and *_COUNT enum values with
UINT_NUM_VALUES() macro for all edge registration arrays. This makes
the array sizing explicit based on the index type and eliminates
tautological comparison warnings.

Update registration functions to use name-based validation instead of
bounds checking where applicable. Use gr_iface_type_name(),
gr_iface_mode_name(), and gr_nh_type_name() for better log messages
and consistent error reporting.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
The interface type name is already available via gr_iface_type_name()
which derives it from the type ID. Storing it again in struct iface_type
is redundant. This field is unused anyway. Remove the name field from
iface_type.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
Replace the linked list of interface types with a static array indexed
by type ID. This makes iface_type_get() an O(1) array lookup instead of
O(n) linked list traversal.

This benefits iface_get_eth_addr() which is called in the eth_input hot
path and needs to look up the interface type to find the get_eth_addr
callback.

Since interface types are no longer linked, the STAILQ_ENTRY field is
removed and the iface_type structs can be declared const.

Substituting pointer chasing with direct array indexing reduces LLC
cache misses by 9% and increases throughput from 11.6M to 11.8M pkt/s
(+1.7%).

Signed-off-by: Robin Jarry <rjarry@redhat.com>
@coderabbitai

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

Instead of updating the shared iface_stats structure for every packet,
accumulate packets and bytes in local variables and flush to the stats
structure only when the interface changes or at the end of the batch.

This reduces memory writes from O(n) to O(number of unique interfaces)
per batch, which is typically O(1) for homogeneous traffic. The shared
stats structure is accessed less frequently, reducing cache line
pressure and store buffer usage.

Reduces LLC cache misses by 11% (3.20M to 2.86M) and eth_input cycles
from 48.0 to 46.2 per packet (-4%). Throughput remains at 11.8M pkt/s
as ip_input (88 cycles/pkt) dominates the critical path.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
RCU quiescent state only needs to be reported periodically to signal
that the thread has passed through a quiescent period. Reporting it on
every graph walk iteration is unnecessarily frequent and adds overhead
from the memory barriers involved.

Move rte_rcu_qsbr_quiescent() from the main loop body into the
housekeeping block that runs every 256 iterations. This reduces the
frequency of memory barriers while still ensuring timely RCU grace
period completion.

Reduces LLC cache misses by 7% (2.86M to 2.67M) and L1 cache misses by
6% (2.14B to 2.01B). Throughput remains at 11.8M pkt/s as ip_input
dominates the critical path.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
Replace individual rte_node_enqueue_x1() calls with a batching scheme.
Each rte_node_enqueue_x1() call accesses the destination node's rte_node
structure to append one packet. Even when all packets go to the same
edge, touching these fields once per packet instead of once per batch
causes unnecessary cache pressure.

Introduce two helper macros:

 - NODE_ENQUEUE_NEXT: tracks runs of consecutive packets going to the
   same edge. When the edge changes, flush the previous run using
   rte_node_enqueue() which copies mbuf pointers in bulk.

 - NODE_ENQUEUE_FLUSH: at end of loop, if all packets went to the same
   edge (common case), use rte_node_next_stream_move() which swaps
   pointers and updates idx once. Otherwise flush the final run.

These macros are not used in port_rx (use stream move directly),
control_input (source node), control_output and port_tx (packets
consumed directly).

Increases throughput by 10.2% (11.8M to 13.0M pkt/s). Cycles per packet
reduced significantly in output nodes: port_output -37%, iface_output
-30%, ip_forward -22%, eth_output -18%, ip_output -14%.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
The rx_process function was checking the ether_type of every received
packet to detect LACP slow protocol frames (0x8809). This check is only
relevant for bonded interfaces, yet it was performed unconditionally on
all ports.

Move the ether_type check outside the main loop and only execute it when
the interface mode is GR_IFACE_MODE_BOND. For non-bonded ports, the loop
now simply stores the iface pointer without accessing packet data.

This reduces L1 cache misses by 6% (2.26B to 2.13B) and port_rx node
cycles per packet by 10% (47.0 to 42.1). The perf annotate data shows
that the ether_type comparison accounted for 59% of L1 cache misses
within rx_process. Eliminating this access for non-bonded ports cuts
rx_process cache miss samples by 61%.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant