datapath: cache optimization for 12% throughput improvement#495
Open
datapath: cache optimization for 12% throughput improvement#495
Conversation
Replace hardcoded array sizes and *_COUNT enum values with UINT_NUM_VALUES() macro for all edge registration arrays. This makes the array sizing explicit based on the index type and eliminates tautological comparison warnings. Update registration functions to use name-based validation instead of bounds checking where applicable. Use gr_iface_type_name(), gr_iface_mode_name(), and gr_nh_type_name() for better log messages and consistent error reporting. Signed-off-by: Robin Jarry <rjarry@redhat.com>
The interface type name is already available via gr_iface_type_name() which derives it from the type ID. Storing it again in struct iface_type is redundant. This field is unused anyway. Remove the name field from iface_type. Signed-off-by: Robin Jarry <rjarry@redhat.com>
Replace the linked list of interface types with a static array indexed by type ID. This makes iface_type_get() an O(1) array lookup instead of O(n) linked list traversal. This benefits iface_get_eth_addr() which is called in the eth_input hot path and needs to look up the interface type to find the get_eth_addr callback. Since interface types are no longer linked, the STAILQ_ENTRY field is removed and the iface_type structs can be declared const. Substituting pointer chasing with direct array indexing reduces LLC cache misses by 9% and increases throughput from 11.6M to 11.8M pkt/s (+1.7%). Signed-off-by: Robin Jarry <rjarry@redhat.com>
This comment was marked as resolved.
This comment was marked as resolved.
Instead of updating the shared iface_stats structure for every packet, accumulate packets and bytes in local variables and flush to the stats structure only when the interface changes or at the end of the batch. This reduces memory writes from O(n) to O(number of unique interfaces) per batch, which is typically O(1) for homogeneous traffic. The shared stats structure is accessed less frequently, reducing cache line pressure and store buffer usage. Reduces LLC cache misses by 11% (3.20M to 2.86M) and eth_input cycles from 48.0 to 46.2 per packet (-4%). Throughput remains at 11.8M pkt/s as ip_input (88 cycles/pkt) dominates the critical path. Signed-off-by: Robin Jarry <rjarry@redhat.com>
RCU quiescent state only needs to be reported periodically to signal that the thread has passed through a quiescent period. Reporting it on every graph walk iteration is unnecessarily frequent and adds overhead from the memory barriers involved. Move rte_rcu_qsbr_quiescent() from the main loop body into the housekeeping block that runs every 256 iterations. This reduces the frequency of memory barriers while still ensuring timely RCU grace period completion. Reduces LLC cache misses by 7% (2.86M to 2.67M) and L1 cache misses by 6% (2.14B to 2.01B). Throughput remains at 11.8M pkt/s as ip_input dominates the critical path. Signed-off-by: Robin Jarry <rjarry@redhat.com>
Replace individual rte_node_enqueue_x1() calls with a batching scheme. Each rte_node_enqueue_x1() call accesses the destination node's rte_node structure to append one packet. Even when all packets go to the same edge, touching these fields once per packet instead of once per batch causes unnecessary cache pressure. Introduce two helper macros: - NODE_ENQUEUE_NEXT: tracks runs of consecutive packets going to the same edge. When the edge changes, flush the previous run using rte_node_enqueue() which copies mbuf pointers in bulk. - NODE_ENQUEUE_FLUSH: at end of loop, if all packets went to the same edge (common case), use rte_node_next_stream_move() which swaps pointers and updates idx once. Otherwise flush the final run. These macros are not used in port_rx (use stream move directly), control_input (source node), control_output and port_tx (packets consumed directly). Increases throughput by 10.2% (11.8M to 13.0M pkt/s). Cycles per packet reduced significantly in output nodes: port_output -37%, iface_output -30%, ip_forward -22%, eth_output -18%, ip_output -14%. Signed-off-by: Robin Jarry <rjarry@redhat.com>
The rx_process function was checking the ether_type of every received packet to detect LACP slow protocol frames (0x8809). This check is only relevant for bonded interfaces, yet it was performed unconditionally on all ports. Move the ether_type check outside the main loop and only execute it when the interface mode is GR_IFACE_MODE_BOND. For non-bonded ports, the loop now simply stores the iface pointer without accessing packet data. This reduces L1 cache misses by 6% (2.26B to 2.13B) and port_rx node cycles per packet by 10% (47.0 to 42.1). The perf annotate data shows that the ether_type comparison accounted for 59% of L1 cache misses within rx_process. Eliminating this access for non-bonded ports cuts rx_process cache miss samples by 61%. Signed-off-by: Robin Jarry <rjarry@redhat.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This series reduces L1 and LLC cache misses in the datapath hot paths, achieving a 12% throughput improvement (11.6M to 13.0M pkt/s) in bi-directional forwarding benchmarks.
The first two commits are minor cleanups that prepare for the optimizations. The third commit replaces the linked list of interface types with a static array, eliminating pointer chasing in iface_type_get() which is called from the eth_input hot path. This alone reduces LLC cache misses by 9% and increases throughput by 1.7%.
The fourth commit batches interface statistics updates. Instead of writing to the shared iface_stats structure for every packet, counters are accumulated locally and flushed only when the interface changes or at batch end. This reduces LLC cache misses by another 11%.
The fifth commit moves RCU quiescent state reporting from every graph walk iteration to the housekeeping interval that runs every 256 iterations. This reduces memory barrier frequency and cuts cache misses by 7%.
The final and most impactful commit replaces per-packet rte_node_enqueue_x1() calls with a batching scheme. For homogeneous traffic where all packets go to the same edge, the destination node structure is now touched once per batch instead of once per packet. This increases throughput by 10.2% and reduces cycles per packet in output nodes by up to 37%.
Detailed analysis with perf annotate observations and per-node cycle counts
# Optimization Benchmark AnalysisThis document presents the performance analysis of the patch series on the
optimizationsbranch targeting L1 and LLC cache miss reduction.Test Environment
Each benchmark was run with:
perf record -g -e L1-dcache-load-misses,LLC-load-misses --call-graph dwarf -C 10 -- sleep 10Summary
Total improvement: 11.6M to 13.0M pkt/s (+12.1%)
Detailed Analysis
iface: store types in a static array
Change: Replaced linked list traversal with static array lookup for interface type resolution.
Before:
After:
Impact: LLC misses reduced by 9% (3.51M to 3.20M) and throughput increased from 11.6M to 11.8M pkt/s (+1.7%). The O(n) linked list traversal touched multiple cache lines for pointer chasing. The O(1) array lookup touches a single cache line. This benefits
iface_get_eth_addr()called in theeth_inputhot path. Theeth_inputnode improved from 48.1 to 48.0 cycles/pkt, which was enough to cross the threshold for the next rate level.infra: accumulate interface stats before flushing
Change: Instead of updating the shared
iface_statsstructure for every packet, accumulate counts in local variables and flush only when the interface changes or at batch end.Impact: LLC misses reduced by 11% (3.20M to 2.86M). For homogeneous traffic where all packets in a batch belong to the same interface, the shared stats structure is accessed once instead of N times, reducing store buffer pressure.
Why no throughput increase: The zero-drop-rate remains at 11.8M pkt/s because
ip_inputdominates the critical path at 88 cycles/pkt and is unaffected by this change. Theeth_inputnode improved from 48.0 to 46.2 cycles/pkt (-4%), but this doesn't move the bottleneck. The cache miss reduction creates headroom for the later batch enqueue optimization.infra: move RCU quiescent state report to housekeeping interval
Change: Move
rte_rcu_qsbr_quiescent()from every graph walk iteration into the housekeeping block that runs every 256 iterations.Impact: LLC misses reduced by 7% (2.86M to 2.67M) and L1 misses reduced by 6% (2.14B to 2.01B). The memory barriers associated with RCU quiescent state reporting were being executed on every loop iteration. Moving them to the housekeeping interval reduces barrier frequency by 256x while still ensuring timely grace period completion.
The throughput remains at 11.8M pkt/s as
ip_input(88 cycles/pkt) still dominates the critical path.datapath: batch packet enqueue to reduce cache misses
Change: Replaced individual
rte_node_enqueue_x1()calls with a batching scheme using helper macros:Impact: Throughput increased by 10.2% (11.8M to 13.0M pkt/s). This is the most significant optimization.
The per-packet
rte_node_enqueue_x1()accessed the destination node'srte_nodestructure (specificallyidxandobjsfields) once per packet. Even when all packets go to the same edge, touching these fields N times instead of once causes cache line bouncing.The new scheme:
rte_node_enqueue()(bulk copy)rte_node_next_stream_move()which swaps pointers and updatesidxonceFor homogeneous traffic (typical case), the destination node structure is touched once per batch instead of once per packet.
Perf Annotate Observations
Baseline port_output_process
Cache misses concentrated at destination node structure accesses:
Optimized port_output_process
Cache miss distribution shifted to edge detection and bulk operations:
The per-packet node structure access pattern is eliminated.
Cycles per Packet Comparison
Node statistics show reduced cycles/pkt for output nodes:
Conclusion
The patch series achieves a 12.1% throughput improvement. The optimizations with clear measurable impact are:
Static array for interface types (+1.7% throughput, -9% LLC misses): Eliminates pointer chasing in hot path lookups.
Interface stats batching (-11% LLC misses): Reduces per-packet writes to shared memory, though throughput was bottleneck-limited elsewhere.
RCU quiescent movement (-7% LLC misses, -6% L1 misses): Reduces memory barrier frequency by moving RCU reporting to housekeeping interval.
Batch packet enqueue (+10.2% throughput, -37% cycles in port_output): The largest improvement, transforming O(n) node structure accesses into O(1) for homogeneous traffic.