fix(net): DpdkBackend::send_frame dropped every frame → zero TCP connections#108
Conversation
…e set_data_len)
Root cause of the DUT establishing ZERO TCP connections on real hardware
(perf run 28552788420: TRex opened 150k flows / sent 21M packets, Active-flows=0).
DpdkBackend::send_frame called mbuf.data_mut() on a freshly-allocated mbuf —
whose data_len is 0 — so the returned slice was zero-length, the
'data.len() < frame.len()' capacity check was ALWAYS true, and it returned
'Frame too large: mbuf capacity: 0' before ever reaching tx_burst. Every
outbound frame (SYN-ACK, ACK, data, RST, retransmit) for both TCP and QUIC
real-NIC was silently dropped. Reorder to compute capacity from
buf_len-data_offset and set_data_len() BEFORE data_mut(), matching the proven
UDP TX path (dpdk-udp/src/lib.rs:1498-1523).
- Tighten test_dpdk_backend_send_frame: it tolerated the capacity error
('is_err() || ...'), masking the bug. Now asserts the frame reaches tx_burst
(Ok, or WouldBlock under stubs) and never fails the capacity check.
- runtime.rs: the engine driver discarded send_frame errors ('let _ ='), so the
silent drop was invisible and cost a full EC2 run to find. Route all 3 TX
sites through send_or_warn(), which logs the first non-transient TX error.
Verified locally (cargo build + test, no EC2). Found via a code-level
investigation of the RX->SYN->SYN-ACK path.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Synthetic Performance Results (run)Commit: ✅ synthetic UDP socket bound to 10.0.0.1:9000 (MAC: 02:00:00:00:00:01) Synthetic UDP Performance ResultsMeasures framework overhead: sync IPv4 Baseline
IPv6
IPv6 vs IPv4 Comparison (sync path)
IPv4 avg sync/async ratio: 0.9x, worst: 1.0x | IPv6 vs IPv4 worst ratio: 1.29x (OK)
|
Synthetic Performance Results — Graviton (run)Commit: ✅ synthetic UDP socket bound to 10.0.0.1:9000 (MAC: 02:00:00:00:00:01) Synthetic UDP Performance ResultsMeasures framework overhead: sync IPv4 Baseline
IPv6
IPv6 vs IPv4 Comparison (sync path)
IPv4 avg sync/async ratio: 0.9x, worst: 1.0x | IPv6 vs IPv4 worst ratio: 1.32x (OK)
|
[CI] Stage: DeployInfrastructure ready.
|
[CI] Stage: DeployInfrastructure ready.
|
[CI] Stage: SummaryAll tests PASSED. ARP seeding: kernel /proc/net/arp (automatic)
|
1 similar comment
[CI] Stage: SummaryAll tests PASSED. ARP seeding: kernel /proc/net/arp (automatic)
|
✅ Integration Tests Passed (Run 28555793043)Branch: Test Results
Application Logs (last 20 lines)receiver-echo-server.log sender-echo-server.log sender-test-client.log receiver-test-client-iperf.log sender-test-client-iperf.log Full Application Logs (last 200 lines each)receiver-echo-server.logsender-echo-server.logsender-test-client.logreceiver-test-client-iperf.logsender-test-client-iperf.log
|
✅ Integration Tests Passed — Graviton (run)Branch: Test Results
Application Logs (last 20 lines)receiver-echo-server.log sender-echo-server.log sender-test-client.log |
…#109) Companion to #108. Perf run [28552788420](https://github.com/gspivey/dpdk-stdlib-rust/actions/runs/28552788420) actually produced real per-config result JSON — but retrieving it with `cat` over SSM truncated it at **~24,000 chars** (AWS SSM `StandardOutputContent` cap). Every downloaded file was exactly **23984 bytes** and ended mid-structure with `--output truncated--`, so JSON parsing failed and the numbers were lost. **Fix:** retrieve with `gzip -c file | base64 -w0` between markers, decode locally. gzip shrinks the JSON ~15× (24 KB → **1.6 KB** on the wire), far under the cap. Verified locally end-to-end (marker-wrap → extract → `base64 -d` → `gunzip` recovers the bytes byte-exact); harness unit tests green. **Both #108 and this are needed for a green TCP perf run:** #108 makes the DUT actually establish connections; this delivers the resulting metrics intact. Merge both, then re-dispatch. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The bug behind the zero-connection perf run
Perf run 28552788420 got all the way to running traffic — TRex opened 150,002 flows and sent 21.3M packets — but
Active-flows: 0. The DUTtcp-echonever completed a single handshake.Root cause (found by code investigation, not another EC2 run):
DpdkBackend::send_framecalledmbuf.data_mut()on a freshly-allocated mbuf, whosedata_lenis 0, so the returned slice was zero-length. Thedata.len() < frame.len()capacity check was therefore always true and the function returnedFrame too large: mbuf capacity: 0before ever calling tx_burst. Every outbound frame — SYN-ACK, ACK, data, RST, retransmit — for both TCP and QUIC real-NIC was silently dropped. The engine driver discarded the error (let _ =), so it was invisible.Fix
send_frameto compute capacity frombuf_len - data_offsetand callset_data_len()beforedata_mut()— matching the proven UDP TX path (dpdk-udp/src/lib.rs:1498-1523, which even has the comment "Set data_len first so data_mut() returns the right size slice").test_dpdk_backend_send_frame— it tolerated the capacity error (is_err() || …), which masked this. Now asserts the frame reaches tx_burst and never fails the capacity check. (Fails before the fix, passes after.)runtime.rs: route all 3 TX sites throughsend_or_warn(), which logs the first non-transient TX error instead oflet _ =. A silent TX drop cost us a full EC2 run to diagnose; it won't be silent again.Impact
Unblocks both TCP and QUIC real-NIC transmit. Verified locally (
cargo build+ 220+ tests pass, no EC2).🤖 Generated with Claude Code