Skip to content

Add L2 bridge interface support#491

Draft
rjarry wants to merge 17 commits intoDPDK:mainfrom
rjarry:bridge
Draft

Add L2 bridge interface support#491
rjarry wants to merge 17 commits intoDPDK:mainfrom
rjarry:bridge

Conversation

@rjarry
Copy link
Collaborator

@rjarry rjarry commented Jan 29, 2026

Implement L2 bridging functionality with MAC learning, flooding, and per-bridge configuration options.

This series introduces a new bridge interface type that can attach port and bond interfaces as members. Packets received on member interfaces are forwarded based on learned MAC addresses, with unknown unicast, broadcast and multicast traffic flooded to all other members.

The first commits prepare the infrastructure by moving the control queue to avoid circular dependencies and allowing nexthop creation from datapath workers. This enables the bridge input node to learn MAC addresses on the fly without blocking the control plane.

The L2 nexthop type stores (bridge_id, vlan_id, mac) tuples with automatic ageing based on per-bridge configurable timeout. Static entries can be created via CLI to override learned ones.

Bridge configuration supports disabling flooding (NO_FLOOD) and MAC learning (NO_LEARN) flags, custom MAC address, and up to 64 member interfaces. The frr plugin exposes bridges to zebra as ZEBRA_IF_BRIDGE with proper slave type on members.

Summary by CodeRabbit

  • New Features

    • Full L2 bridge support: bridge interfaces, member management, ageing/MAC handling, L2 nexthops, dataplane input/flood processing, and CLI for bridge and L2 nexthops.
  • Bug Fixes

    • More robust control-plane event delivery: deferred notifications, explicit control-queue draining, and QSBR sync tied to current core for deterministic reclamation.
  • Tests

    • Added smoke test validating L2/L3 bridge setup, learning, forwarding and teardown.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Jan 29, 2026

📝 Walkthrough

Walkthrough

Adds a new L2 module (api, cli, control, datapath) implementing bridge interfaces and L2 nexthops: new enums GR_IFACE_TYPE_BRIDGE/GR_IFACE_MODE_BRIDGE, GR_NH_T_L2/GR_NH_ORIGIN_BRIDGE, and L2 API/header types. Introduces bridge control and L2 nexthop management, CLI commands, datapath nodes bridge_input and bridge_flood, and a smoke test. Adds gr_control_queue.h and control_queue_drain plus deferred event delivery from dataplane workers; several rte_rcu_qsbr_synchronize calls now use rte_lcore_id(). Build files and complexity metrics updated.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Fix all issues with AI agents
In `@main/control_queue.c`:
- Around line 89-92: The current code violates the single-consumer ring
contract: ctrlq_ring is created with RING_F_SC_DEQ but control_queue_poll
(consumer in control queue thread) and control_queue_drain can both dequeue
concurrently; fix by either (A) forcing all dequeues onto the control queue
thread — change control_queue_drain to enqueue the drain request (use the same
enqueue API/event mechanism and have control_queue_poll handle the actual
dequeue and processing) so only control_queue_poll performs ring_dequeue, or (B)
switch the ring to multi-consumer mode — replace RING_F_SC_DEQ with the
appropriate multi-consumer flag (e.g., RING_F_MP_DEQ) when creating ctrlq_ring
and audit callers to ensure they safely call the ring API concurrently; update
references to control_queue_drain and control_queue_poll accordingly.

In `@main/event.c`:
- Around line 42-44: control_queue_push's failure path currently drops
notifications silently; update the error path in the block handling
control_queue_push(notify_subscribers, (void *)obj, ev_type) to record the
failure (e.g., increment a dedicated metric/error counter and emit a descriptive
log entry including notify_subscribers, ev_type and obj identifiers) so drops
are observable; optionally consider a bounded retry or backpressure mechanism
later, but at minimum add the metric increment and processLogger/syslog error
for visibility.

In `@modules/l2/datapath/bridge_flood.c`:
- Around line 114-122: The tracing loop dereferences br->id without checking br
for NULL; update the block guarded by gr_mbuf_is_traced(m) (the loop that
creates struct bridge_flood_trace via gr_mbuf_trace_add on clones[]) to first
test whether br is non-NULL (e.g., if (br) { t->bridge_id = br->id; } else {
t->bridge_id = 0; } or skip setting bridge_id) to avoid dereferencing a NULL
pointer returned by iface_from_id(); ensure the rest of the fields
(t->flood_count) and trace addition still happen only after this NULL-safe
handling.
- Line 92: The edge selection is using the source iface type (iface->type) but
should use the destination member type (member->type); update the assignment
that sets edges[flood_count] in the bridge_flood code so it checks member->type
against GR_IFACE_TYPE_VLAN and assigns ETH_IN or OUTPUT accordingly (keep the
special-case behavior for the bridge interface destination at the existing later
check). Locate the edges[flood_count] assignment in bridge_flood and replace the
reference to iface->type with member->type to ensure routing is determined by
destination interface type.

In `@modules/l2/datapath/bridge_input.c`:
- Around line 65-68: The code extracts vlan_id when RTE_MBUF_F_RX_VLAN_STRIPPED
is set but never persists it, so egress sends untagged frames; fix by storing
the extracted vlan_id back into the mbuf metadata (e.g., assign m->vlan_tci =
vlan_id) and mark the mbuf for re-tagging (e.g., set a re-tag flag such as
RTE_MBUF_F_TX_VLAN or a module-specific mbuf flag) in the branch that checks
RTE_MBUF_F_RX_VLAN_STRIPPED so the egress path (which will read m->vlan_tci) can
re-insert the VLAN header; ensure you still clear or adjust
RTE_MBUF_F_RX_VLAN_STRIPPED consistently if needed.

Comment on lines +89 to 92
void control_queue_drain(uint32_t event, const void *obj) {
struct control_queue_drain drain = {event, obj};
control_queue_poll(0, 0, &drain);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the control_queue.c file to understand the structure
cat -n main/control_queue.c | head -120

Repository: DPDK/grout

Length of output: 3959


🏁 Script executed:

# Search for ctrlq_ring creation and how it's initialized
rg "ctrlq_ring" --context 3

Repository: DPDK/grout

Length of output: 2087


🏁 Script executed:

# Find all call sites of control_queue_drain
rg "control_queue_drain" --context 2

Repository: DPDK/grout

Length of output: 9319


🏁 Script executed:

# Search for RING_F_SC_DEQ usage to understand ring configuration
rg "RING_F_SC_DEQ|RING_F_SC" --context 2

Repository: DPDK/grout

Length of output: 262


🏁 Script executed:

# Look at the entire control_queue_poll function to understand what it does
ast-grep --pattern 'fn control_queue_poll($_, $_, $_) { $$$ }'

Repository: DPDK/grout

Length of output: 36


Concurrent dequeue violates SC ring contract

ctrlq_ring is configured with RING_F_SC_DEQ, but two code paths can dequeue concurrently: the libevent event loop (via control_queue_poll callback in the dedicated control queue thread) and control_queue_drain() called directly from module code (e.g., nexthop deletion, interface removal). This is undefined behavior. Either ensure both paths run on the same consumer thread with proper synchronization, or switch to a multi-consumer ring mode.

🤖 Prompt for AI Agents
In `@main/control_queue.c` around lines 89 - 92, The current code violates the
single-consumer ring contract: ctrlq_ring is created with RING_F_SC_DEQ but
control_queue_poll (consumer in control queue thread) and control_queue_drain
can both dequeue concurrently; fix by either (A) forcing all dequeues onto the
control queue thread — change control_queue_drain to enqueue the drain request
(use the same enqueue API/event mechanism and have control_queue_poll handle the
actual dequeue and processing) so only control_queue_poll performs ring_dequeue,
or (B) switch the ring to multi-consumer mode — replace RING_F_SC_DEQ with the
appropriate multi-consumer flag (e.g., RING_F_MP_DEQ) when creating ctrlq_ring
and audit callers to ensure they safely call the ring API concurrently; update
references to control_queue_drain and control_queue_poll accordingly.

Comment on lines +42 to +44
if (control_queue_push(notify_subscribers, (void *)obj, ev_type) < 0) {
// XXX: add error stat if push fails?
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Silent drop on queue push failure loses events.

When control_queue_push returns an error (e.g., ring full), the notification is silently discarded. For critical events this could cause inconsistent state between dataplane and control plane. Consider at minimum incrementing an error counter for observability.

🤖 Prompt for AI Agents
In `@main/event.c` around lines 42 - 44, control_queue_push's failure path
currently drops notifications silently; update the error path in the block
handling control_queue_push(notify_subscribers, (void *)obj, ev_type) to record
the failure (e.g., increment a dedicated metric/error counter and emit a
descriptive log entry including notify_subscribers, ev_type and obj identifiers)
so drops are observable; optionally consider a bounded retry or backpressure
mechanism later, but at minimum add the metric increment and
processLogger/syslog error for visibility.

if (clone == NULL)
continue;

edges[flood_count] = iface->type == GR_IFACE_TYPE_VLAN ? ETH_IN : OUTPUT;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cd modules/l2/datapath && head -120 bridge_flood.c | tail -40

Repository: DPDK/grout

Length of output: 1051


🏁 Script executed:

cat -n modules/l2/datapath/bridge_flood.c | sed -n '70,110p'

Repository: DPDK/grout

Length of output: 1351


🏁 Script executed:

cat -n modules/l2/datapath/bridge_flood.c | sed -n '60,115p'

Repository: DPDK/grout

Length of output: 1778


🏁 Script executed:

rg -A 5 -B 5 "struct gr_iface" modules/l2/datapath/bridge_flood.c | head -50

Repository: DPDK/grout

Length of output: 36


🏁 Script executed:

cat -n modules/l2/datapath/bridge_flood.c | sed -n '50,130p'

Repository: DPDK/grout

Length of output: 2568


Destination routing should check member type, not source interface type.

Packets are being cloned for each bridge member destination (line 88), but the edge selection at line 92 checks the source interface type (iface->type) instead of the destination (member->type). This is inconsistent with the special case at line 102, where the bridge interface destination always routes through ETH_IN regardless of source type, implying that destination routing should depend on destination interface type.

Proposed fix
-			edges[flood_count] = iface->type == GR_IFACE_TYPE_VLAN ? ETH_IN : OUTPUT;
+			edges[flood_count] = member->type == GR_IFACE_TYPE_VLAN ? ETH_IN : OUTPUT;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
edges[flood_count] = iface->type == GR_IFACE_TYPE_VLAN ? ETH_IN : OUTPUT;
edges[flood_count] = member->type == GR_IFACE_TYPE_VLAN ? ETH_IN : OUTPUT;
🤖 Prompt for AI Agents
In `@modules/l2/datapath/bridge_flood.c` at line 92, The edge selection is using
the source iface type (iface->type) but should use the destination member type
(member->type); update the assignment that sets edges[flood_count] in the
bridge_flood code so it checks member->type against GR_IFACE_TYPE_VLAN and
assigns ETH_IN or OUTPUT accordingly (keep the special-case behavior for the
bridge interface destination at the existing later check). Locate the
edges[flood_count] assignment in bridge_flood and replace the reference to
iface->type with member->type to ensure routing is determined by destination
interface type.

Comment on lines +114 to +127
if (gr_mbuf_is_traced(m)) {
for (uint16_t i = 0; i < flood_count; i++) {
struct bridge_flood_trace *t = gr_mbuf_trace_add(
clones[i], node, sizeof(*t)
);
t->bridge_id = br->id;
t->flood_count = i;
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Null pointer dereference when br is NULL and tracing is enabled.

When iface_from_id() returns NULL at line 72, the code jumps to next: and sets up the drop path. However, if tracing is enabled, line 119 dereferences br->id which will crash.

Proposed fix
 		if (gr_mbuf_is_traced(m)) {
 			for (uint16_t i = 0; i < flood_count; i++) {
 				struct bridge_flood_trace *t = gr_mbuf_trace_add(
 					clones[i], node, sizeof(*t)
 				);
-				t->bridge_id = br->id;
+				t->bridge_id = br ? br->id : 0;
 				t->flood_count = i;
 			}
 		}
🤖 Prompt for AI Agents
In `@modules/l2/datapath/bridge_flood.c` around lines 114 - 122, The tracing loop
dereferences br->id without checking br for NULL; update the block guarded by
gr_mbuf_is_traced(m) (the loop that creates struct bridge_flood_trace via
gr_mbuf_trace_add on clones[]) to first test whether br is non-NULL (e.g., if
(br) { t->bridge_id = br->id; } else { t->bridge_id = 0; } or skip setting
bridge_id) to avoid dereferencing a NULL pointer returned by iface_from_id();
ensure the rest of the fields (t->flood_count) and trace addition still happen
only after this NULL-safe handling.

Comment on lines 65 to 68
if (m->ol_flags & RTE_MBUF_F_RX_VLAN_STRIPPED) {
vlan_id = m->vlan_tci & 0xfff;
m->ol_flags &= ~RTE_MBUF_F_RX_VLAN_STRIPPED;
} else if (eth->ether_type == RTE_BE16(RTE_ETHER_TYPE_VLAN)) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's see the full context around the flagged lines
cat -n modules/l2/datapath/bridge_input.c | sed -n '50,85p'

Repository: DPDK/grout

Length of output: 1481


🏁 Script executed:

# Check if TX VLAN is set elsewhere in the bridging code
rg "RTE_MBUF_F_TX_VLAN" modules/l2/datapath/ -A 2 -B 2

Repository: DPDK/grout

Length of output: 36


🏁 Script executed:

# Look for how packets are forwarded and where re-tagging happens
rg "vlan_id|vlan_tci" modules/l2/datapath/ -A 2 -B 2

Repository: DPDK/grout

Length of output: 1962


🏁 Script executed:

# Check the TX path to understand if/how TX VLAN is honored
rg "tx_burst|send_pkts" modules/l2/datapath/ -A 3 -B 1

Repository: DPDK/grout

Length of output: 36


🏁 Script executed:

# Understand the overall bridging architecture
head -50 modules/l2/datapath/bridge_input.c

Repository: DPDK/grout

Length of output: 1014


🏁 Script executed:

# Look for the OUTPUT edge handling - where packets actually get sent
rg "OUTPUT|ETH_IN|FLOOD" modules/l2/datapath/bridge_input.c -B 5 -A 5 | head -60

Repository: DPDK/grout

Length of output: 1360


🏁 Script executed:

# Check if there's a bridge_output or similar file that handles TX path
fd "bridge" modules/l2/datapath/ -type f

Repository: DPDK/grout

Length of output: 224


🏁 Script executed:

# Search for VLAN re-tagging or insertion logic in the bridge/output path
rg "TX_VLAN|tx_vlan|retag|insert.*vlan" modules/l2/ -l

Repository: DPDK/grout

Length of output: 36


🏁 Script executed:

# Look for nexthop forwarding to see how packets are output
rg "nexthop.*output|output.*packet" modules/l2/datapath/ -B 2 -A 2

Repository: DPDK/grout

Length of output: 36


🏁 Script executed:

# Check what happens after nexthop lookup - the 'next' label flow
cat -n modules/l2/datapath/bridge_input.c | sed -n '80,120p'

Repository: DPDK/grout

Length of output: 1391


🏁 Script executed:

# Find the iface_output handling to see if it re-tags packets
fd iface_output -type f modules/l2/

Repository: DPDK/grout

Length of output: 224


🏁 Script executed:

# Search for any VLAN handling in output/TX paths
rg "iface_output|tx_vlan|vlan.*tx" modules/l2/ -B 2 -A 2 | head -80

Repository: DPDK/grout

Length of output: 1016


🏁 Script executed:

# Look for where vlan_id might be stored/used after bridge_input
rg "mbuf_data" modules/l2/ -A 3 -B 1 | head -100

Repository: DPDK/grout

Length of output: 1537


🏁 Script executed:

# Check the mbuf structure definition to see if vlan_id is stored anywhere
find . -name "*.h" -type f | xargs grep -l "mbuf_data\|mbuf_info" | head -5

Repository: DPDK/grout

Length of output: 241


🏁 Script executed:

# Look at the mbuf_data structure definition
cat -n ./modules/infra/datapath/gr_mbuf.h | head -80

Repository: DPDK/grout

Length of output: 3885


🏁 Script executed:

# Search for VLAN-related fields in mbuf_data
rg "vlan|mbuf_data" ./modules/infra/datapath/gr_mbuf.h -A 5 -B 2

Repository: DPDK/grout

Length of output: 736


🏁 Script executed:

# Check the iface_output implementation to see if it uses any mbuf_data fields
find . -name "*iface_output*" -o -name "*output*.c" | grep -v build | head -10

Repository: DPDK/grout

Length of output: 449


🏁 Script executed:

# Look at the br->flags to understand how VLAN is supposed to be handled in bridge
rg "GR_BRIDGE_F" modules/l2/ -B 1 -A 1

Repository: DPDK/grout

Length of output: 1973


🏁 Script executed:

# Check iface_output implementation - this is where packets are actually transmitted
cat -n ./modules/infra/datapath/iface_output.c | head -100

Repository: DPDK/grout

Length of output: 3170


🏁 Script executed:

# Look for any VLAN re-tagging in eth_output or iface_output
rg "TX_VLAN|vlan_tci|ol_flags" ./modules/infra/datapath/iface_output.c -B 2 -A 2

Repository: DPDK/grout

Length of output: 36


🏁 Script executed:

# Check eth_output.c for VLAN handling
cat -n ./modules/infra/datapath/eth_output.c | head -80

Repository: DPDK/grout

Length of output: 2511


🏁 Script executed:

# Look for any VLAN handling in the output path
rg "RTE_MBUF_F_TX_VLAN|RTE_MBUF_F_RX_VLAN" modules/ -B 2 -A 2

Repository: DPDK/grout

Length of output: 1512


🏁 Script executed:

# Check if there's any mechanism to preserve vlan_id for egress
rg "vlan_id" modules/l2/datapath/bridge_input.c -B 2 -A 2 | tail -30

Repository: DPDK/grout

Length of output: 955


🏁 Script executed:

# Look at bridge_flood to see if it handles VLAN preservation
cat -n modules/l2/datapath/bridge_flood.c | head -120

Repository: DPDK/grout

Length of output: 3573


🏁 Script executed:

# Search for any mbuf_data structure that might store vlan_id
rg "struct.*mbuf.*data|GR_MBUF_PRIV_DATA_TYPE" modules/l2/ -A 5

Repository: DPDK/grout

Length of output: 396


🏁 Script executed:

# Check if vlan_id is ever stored in the packet metadata for TX
rg "\.vlan_id\s*=" modules/l2/datapath/ -B 2 -A 2

Repository: DPDK/grout

Length of output: 36


VLAN tags are lost for bridged traffic when RX stripping is enabled.

When RTE_MBUF_F_RX_VLAN_STRIPPED is set, the VLAN ID is extracted to a local variable for FDB decisions but never stored in mbuf metadata. After forwarding, the TX path has no knowledge of the original VLAN ID and transmits the frame untagged, breaking VLAN isolation. The vlan_id must be persisted (e.g., in mbuf_data) and re-tagged on egress.

Note: Setting RTE_MBUF_F_TX_VLAN alone is insufficient—the TX path does not currently honor this flag or re-insert VLAN headers based on it.

🤖 Prompt for AI Agents
In `@modules/l2/datapath/bridge_input.c` around lines 65 - 68, The code extracts
vlan_id when RTE_MBUF_F_RX_VLAN_STRIPPED is set but never persists it, so egress
sends untagged frames; fix by storing the extracted vlan_id back into the mbuf
metadata (e.g., assign m->vlan_tci = vlan_id) and mark the mbuf for re-tagging
(e.g., set a re-tag flag such as RTE_MBUF_F_TX_VLAN or a module-specific mbuf
flag) in the branch that checks RTE_MBUF_F_RX_VLAN_STRIPPED so the egress path
(which will read m->vlan_tci) can re-insert the VLAN header; ensure you still
clear or adjust RTE_MBUF_F_RX_VLAN_STRIPPED consistently if needed.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@modules/l2/control/nexthop.c`:
- Around line 252-261: In the age-out branch inside nexthop L2 cleanup, you must
remove the nexthop from the hash table before calling nexthop_decref(nh) to
avoid leaving a dangling pointer; change the block that currently calls only
nexthop_decref(nh) to first delete the entry from the nexthop hash (use the same
hash-remove helper as used by nexthop_l2_purge_iface and
nexthop_l2_purge_bridge) and then call nexthop_decref(nh). Ensure the
hash-remove call targets the same key/nh used by the LOG(...) and precedes
nexthop_decref(nh).

@rjarry rjarry force-pushed the bridge branch 2 times, most recently from abf123b to 2155207 Compare January 29, 2026 16:28
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@modules/l2/cli/nexthop.c`:
- Line 83: The VLAN max value is off-by-one: update the ec_node_uint call inside
the with_help invocation that defines "VLAN" to use 4095 as the upper bound
instead of 4096 so the valid range becomes 0–4095; locate the with_help("VLAN
ID.", ec_node_uint("VLAN", 0, 4096, 10)) expression and change the upper limit
parameter passed to ec_node_uint to 4095.

"Create a new L2 entry.",
with_help("Output interface.", ec_node_dyn("IFACE", complete_iface_names, NULL)),
with_help("Nexthop ID.", ec_node_uint("ID", 1, UINT32_MAX - 1, 10)),
with_help("VLAN ID.", ec_node_uint("VLAN", 0, 4096, 10)),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

VLAN ID range allows invalid value 4096.

VLAN IDs are 12-bit values with a valid range of 0-4095. The max value 4096 allows users to enter an invalid VLAN ID.

🐛 Proposed fix
-		with_help("VLAN ID.", ec_node_uint("VLAN", 0, 4096, 10)),
+		with_help("VLAN ID.", ec_node_uint("VLAN", 0, 4095, 10)),
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
with_help("VLAN ID.", ec_node_uint("VLAN", 0, 4096, 10)),
with_help("VLAN ID.", ec_node_uint("VLAN", 0, 4095, 10)),
🤖 Prompt for AI Agents
In `@modules/l2/cli/nexthop.c` at line 83, The VLAN max value is off-by-one:
update the ec_node_uint call inside the with_help invocation that defines "VLAN"
to use 4095 as the upper bound instead of 4096 so the valid range becomes
0–4095; locate the with_help("VLAN ID.", ec_node_uint("VLAN", 0, 4096, 10))
expression and change the upper limit parameter passed to ec_node_uint to 4095.

@rjarry rjarry marked this pull request as draft February 1, 2026 20:47
rjarry added 17 commits February 3, 2026 14:51
Replace hardcoded array sizes and *_COUNT enum values with
UINT_NUM_VALUES() macro for all edge registration arrays. This makes
the array sizing explicit based on the index type and eliminates
tautological comparison warnings.

Update registration functions to use name-based validation instead of
bounds checking where applicable. Use gr_iface_type_name(),
gr_iface_mode_name(), and gr_nh_type_name() for better log messages
and consistent error reporting.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
The interface type name is already available via gr_iface_type_name()
which derives it from the type ID. Storing it again in struct iface_type
is redundant. This field is unused anyway. Remove the name field from
iface_type.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
Replace the linked list of interface types with a static array indexed
by type ID. This makes iface_type_get() an O(1) array lookup instead of
O(n) linked list traversal.

This benefits iface_get_eth_addr() which is called in the eth_input hot
path and needs to look up the interface type to find the get_eth_addr
callback.

Since interface types are no longer linked, the STAILQ_ENTRY field is
removed and the iface_type structs can be declared const.

Substituting pointer chasing with direct array indexing reduces LLC
cache misses by 9% and increases throughput from 11.6M to 11.8M pkt/s
(+1.7%).

Signed-off-by: Robin Jarry <rjarry@redhat.com>
Instead of updating the shared iface_stats structure for every packet,
accumulate packets and bytes in local variables and flush to the stats
structure only when the interface changes or at the end of the batch.

This reduces memory writes from O(n) to O(number of unique interfaces)
per batch, which is typically O(1) for homogeneous traffic. The shared
stats structure is accessed less frequently, reducing cache line
pressure and store buffer usage.

Reduces LLC cache misses by 11% (3.20M to 2.86M) and eth_input cycles
from 48.0 to 46.2 per packet (-4%). Throughput remains at 11.8M pkt/s
as ip_input (88 cycles/pkt) dominates the critical path.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
RCU quiescent state only needs to be reported periodically to signal
that the thread has passed through a quiescent period. Reporting it on
every graph walk iteration is unnecessarily frequent and adds overhead
from the memory barriers involved.

Move rte_rcu_qsbr_quiescent() from the main loop body into the
housekeeping block that runs every 256 iterations. This reduces the
frequency of memory barriers while still ensuring timely RCU grace
period completion.

Reduces LLC cache misses by 7% (2.86M to 2.67M) and L1 cache misses by
6% (2.14B to 2.01B). Throughput remains at 11.8M pkt/s as ip_input
dominates the critical path.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
Replace individual rte_node_enqueue_x1() calls with a batching scheme.
Each rte_node_enqueue_x1() call accesses the destination node's rte_node
structure to append one packet. Even when all packets go to the same
edge, touching these fields once per packet instead of once per batch
causes unnecessary cache pressure.

Introduce two helper macros:

 - NODE_ENQUEUE_NEXT: tracks runs of consecutive packets going to the
   same edge. When the edge changes, flush the previous run using
   rte_node_enqueue() which copies mbuf pointers in bulk.

 - NODE_ENQUEUE_FLUSH: at end of loop, if all packets went to the same
   edge (common case), use rte_node_next_stream_move() which swaps
   pointers and updates idx once. Otherwise flush the final run.

These macros are not used in port_rx (use stream move directly),
control_input (source node), control_output and port_tx (packets
consumed directly).

Increases throughput by 10.2% (11.8M to 13.0M pkt/s). Cycles per packet
reduced significantly in output nodes: port_output -37%, iface_output
-30%, ip_forward -22%, eth_output -18%, ip_output -14%.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
The rx_process function was checking the ether_type of every received
packet to detect LACP slow protocol frames (0x8809). This check is only
relevant for bonded interfaces, yet it was performed unconditionally on
all ports.

Move the ether_type check outside the main loop and only execute it when
the interface mode is GR_IFACE_MODE_BOND. For non-bonded ports, the loop
now simply stores the iface pointer without accessing packet data.

This reduces L1 cache misses by 6% (2.26B to 2.13B) and port_rx node
cycles per packet by 10% (47.0 to 42.1). The perf annotate data shows
that the ether_type comparison accounted for 59% of L1 cache misses
within rx_process. Eliminating this access for non-bonded ports cuts
rx_process cache miss samples by 61%.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
Requesting RTE_ETH_TX_OFFLOAD_VLAN_INSERT at port configuration time
causes DPDK to select a more expensive TX function that checks every
packet's ol_flags for offload requests. For the ice driver, this means
using ice_xmit_pkts_vec_avx2_offload instead of ice_xmit_pkts_vec_avx2,
adding roughly 4% overhead even when no packets require VLAN insertion.

Introduce add_subinterface and del_subinterface callbacks in struct
iface_type to let interface types react when subinterfaces are created
or destroyed. The callbacks are invoked from iface_add_subinterface()
and iface_del_subinterface() which now return an error code.

Implement these callbacks for ports: when the first VLAN subinterface is
added, enable RTE_ETH_TX_OFFLOAD_VLAN_INSERT and reconfigure the port.
When the last VLAN subinterface is removed, disable it. For bonds,
propagate the setting to all member ports. Also handle the case where
a port joins or leaves a bond that already has VLAN subinterfaces.

Store the negotiated RX and TX offload flags in iface_info_port so that
datapath nodes can check hardware capabilities without calling
rte_eth_dev_info_get() at runtime.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
The eth_output node was resolving VLAN subinterfaces and inserting VLAN
headers inline. This coupled L2 framing with interface type logic and
prevented using hardware TX VLAN offload since eth_output has no access
to port capabilities.

Move VLAN subinterface resolution to iface_output. When the destination
interface is a VLAN, resolve its parent and store the VLAN ID in mbuf
private data via the new iface_inout_mbuf_data structure. Update the
output interface pointer to the parent so that downstream nodes operate
on the physical port or bond.

Move VLAN header insertion to port_output which can check port->tx_offloads.
When RTE_ETH_TX_OFFLOAD_VLAN_INSERT is available, set RTE_MBUF_F_TX_VLAN
and vlan_tci on the mbuf and let the hardware insert the tag. Otherwise,
prepend the VLAN header in software by shifting the Ethernet addresses
and inserting the 802.1Q tag.

Simplify eth_input and eth_output traces to only record the Ethernet
header. The interface and VLAN information that was previously in these
traces moves to iface_output. The eth_input trace temporarily loses
this context; it will be restored once iface_input is introduced on the
RX path.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
Introduce iface_input between port_rx and eth_input to provide a L1
interface abstraction on the RX path, mirroring iface_output on TX.

This refactoring prepares for bridge support. When VLAN subinterfaces
are bridge members, the VLAN tag must be stripped before L2 processing
so the bridge can learn and forward based on the inner frame. With
iface_input in place, the processing order becomes: port_rx, iface_input,
bridge_input. Packets only enter eth_input when destined for the local
stack; pure L2 switching bypasses Ethernet header parsing entirely.

Move VLAN tag handling to port_rx which has access to port->rx_offloads
and can check whether the hardware already stripped the tag. When the
VLAN is still inline, strip it by shifting the Ethernet addresses
forward and adjusting the packet offset. Store the resolved interface
and VLAN ID in iface_inout_mbuf_data so downstream nodes always see
a clean frame. Handle bond member ports specially: LACP slow protocol
frames must be processed by the member port, not the bond.

The new iface_input node handles VLAN subinterface resolution for VRF
mode interfaces, interface up/down checking, and RX statistics
accounting. It routes packets to the next node based on interface mode.

Simplify eth_input to only parse the Ethernet header and classify
packets by destination address domain (local, broadcast, multicast,
other). Move iface_input_mode_register() to iface_input.c and register
bond mode from lacp_input since bond members need eth_input for LACP.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
A future change will require calling control_queue_push() from
gr_event_push() which lives in main/. If control_queue stays in the
infra module, this would create a circular dependency between main and
infra.

Move control_queue.c and gr_control_queue.h to main/ and replace the
event-based drain mechanism with explicit control_queue_drain() calls
from iface_destroy() and nexthop_destroy() after the RCU sync.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
Pass rte_lcore_id() to rte_rcu_qsbr_synchronize() instead of
RTE_QSBR_THRID_INVALID to exclude the calling thread from the quiescent
state wait. This is needed to allow creating objects from datapath
workers.

Also move the quiescent state report inside the housekeeping block
to reduce overhead.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
Previously, calling nexthop_new() from a datapath thread would trigger
an ABORT(). This was a defensive measure to prevent calling functions
that are not thread-safe from the wrong context.

In practice, the only constraint is that event notifications must be
processed on the control plane thread. Modify gr_event_push() to detect
when it is called from a datapath worker and use the control queue to
defer the notification to the control plane event loop.

This enables datapath nodes (such as bridge MAC learning) to create
nexthops on the fly without blocking the control plane.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
Introduce a new L2 nexthop type (GR_NH_T_L2) and associated origin
(GR_NH_ORIGIN_BRIDGE) to support MAC address learning on bridge
interfaces.

Create a new l2 module with API, control plane and CLI components. The
API defines gr_nexthop_info_l2 containing bridge_id, vlan_id and mac
address fields. The control plane implements a RCU-protected hash table
for fast lookups by (bridge_id, vlan_id, mac) tuple.

The nexthop_learn_l2() function can be called from datapath workers to
create new entries or refresh existing ones. Learned entries have a 300
second ageing timeout after which they are automatically removed. Helper
functions allow purging all entries associated with a given interface or
bridge.

The CLI provides grcli commands to manually add L2 nexthops and display
their info.

No support in the frr plugin for now. It will require some more thinking
to translate grout nexthops into zebra neighbors back and forth.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
Implement a new bridge interface type (GR_IFACE_TYPE_BRIDGE) along with
its associated mode (GR_IFACE_MODE_BRIDGE) for member interfaces.

The bridge configuration includes a configurable ageing time for learned
MAC addresses (default 300s), flags to disable flooding and MAC
learning, an optional MAC address (random if not specified), and up to
64 member interfaces (ports or bonds).

When a member interface is attached to a bridge, its mode is set to
GR_IFACE_MODE_BRIDGE and its domain_id points to the bridge. When
detached, it reverts to GR_IFACE_MODE_VRF. On bridge deletion, all
members are detached and all associated L2 nexthops are purged.

Bridge interfaces are always in VRF mode and they reference themselves
via their domain_id. The L2 nexthop ageing callback now uses the
per-bridge ageing_time setting instead of a hardcoded value.

Update the frr plugin to expose bridge interfaces to zebra as
ZEBRA_IF_BRIDGE and set the bridge_ifindex on member interfaces with
ZEBRA_IF_SLAVE_BRIDGE slave type.

For now, there is no datapath support for bridge interfaces. This will
be added in the next commit.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
Add the packet processing nodes for bridge forwarding. Register
GR_IFACE_MODE_BRIDGE and and GR_IFACE_TYPE_BRIDGE in the port_rx and
iface_output nodes, respectively to to steer packets to bridge_input.

The bridge_input node handles incoming packets on bridge member
interfaces. It extracts the VLAN ID (if present), learns the source MAC
address (unless learning is disabled), and performs destination lookup.
Known unicast destinations are forwarded directly to the learned output
interface. Unknown unicast, broadcast and multicast packets are sent to
the flood node. Packets destined back to their source interface are
dropped as hairpin.

The bridge_flood node replicates packets to all bridge members except
the source interface. It also floods to the bridge interface itself.
Packets are cloned for each output member.

Both nodes handle the NO_FLOOD and NO_LEARN bridge flags to allow
disabling these behaviors per bridge.

When packets go out of a bridge into a bridge interface, they are sent
to eth_input.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
Verify basic bridge functionality with three ports attached to a bridge
interface. The test exercises L2 forwarding between namespaces, MAC
learning, static L2 nexthop creation overriding a learned entry, and L3
connectivity to the bridge interface itself.

Also verify that detaching a member port and deleting the bridge
properly cleans up the associated L2 nexthops.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant