CVPN-2361: Add batch receive on apple platforms#378
CVPN-2361: Add batch receive on apple platforms#378kp-thomas-yau wants to merge 13 commits intomainfrom
Conversation
Make this a feature to make it opt-in for now. If this feature stabilize later, we can make it part of the code after rounds of rounds of QA tests.
Add a new `enable_batch_receive()` and only call it when `enable_batch_receive` config item is true, so the client-side can enable/disable this feature.
When `batch_receive` is enabled, the outside IO task can no longer rely on polling the socket for readability since packets will be received by a separate recv task using recvmsg_x. Instead, use a Semaphore to signal packet availability — the recv task adds one permit per packet pushed into the recv_queue, and the outside IO task awaits a permit before processing.
|
Code coverage summary for 063a5eb: ✅ Region coverage 69% passes |
Wrapping Arc in this case prevents modification after we create the Socket object, so we now wrap the socket with Arc before we return `outside_io` in the main connection function.
Stop clogging Udp struct by adding fields and logics by moving them to a new file instead for cleaner look.
…ng buffer `handle_udp_recv` runs when the socket is readable and attempts to receive multiple packets all at once. If it succeeds, we push all the packets at once via `write_chunk_uninit` by filling the ring buffer with an iterator to save time. Also add a `BatchReceiverConsumerError` error to return to so that we can propagate the error properly when it early exits.
Fixed a bug where we're reading MAX_OUTSIDE_MTU for every outside packet and read from the actual message data length.
Discovered a bug while stress-testing (with speed test) on a 10Gbps line where the ring buffer quickly gets overwhelmed and became full. If it's full, the tokio select function will constantly loop (as there's packets that is readable in kernel already) and not yielding control back to the scheduler. It will starve outside_io_task and causing none of the packets to be processed. So we force it to yield now so that tokio task scheduler will pick up other tasks.
318c420 to
61f4d8d
Compare
a287c6c to
9ada098
Compare
Add unit tests for both `BatchReceiver` struct and `handle_udp_recv`. Also update CI to run the unit tests as well
Use self.sock.as_ref() so pmtud helpers receive &UdpSocket instead of &Arc<UdpSocket>, which does not implement AsRawSocket on Windows.
b633632 to
00eb6ca
Compare
kp-mariappan-ramasamy
left a comment
There was a problem hiding this comment.
It is well done PR, Thanks
Changes in general LGTM
I was also thinking whether it would be nicer to push this "recv_multiple" concept further up in the chain to OutsideIO like:
/// Receive multiple packets at once. Each received packet is placed into
/// the corresponding slot of `bufs`. Returns the number of packets
/// received. The default receives a single packet via [`recv_buf`](Self::recv_buf).
fn recv_multiple_buf(&self, bufs: &mut [bytes::BytesMut]) -> IOCallbackResult<usize>;So application can use it if this support is enabled
There was a problem hiding this comment.
we can get rid of this commit e9b6b74419606e5667945e2e4eef958b997db740
clippy: Only import 'IpPmtudisc' on Windows
| /// `poll(tokio::io::Interest::READABLE)` on the socket itself. But, if this client | ||
| /// is built with `batch_receive` and `enable_batch_receive` is set to true in the config, | ||
| /// it will instead try to acquire a permit of a Semaphore. |
There was a problem hiding this comment.
We could move the comment about batch_receive to the actual function instead of trait definition, since it is just batch_receive's implementaion detail and can change later.
But, if this client
/// is built with `batch_receive` and `enable_batch_receive` is set to true in the config,
/// it will instead try to acquire a permit of a Semaphore.
| // Not supported yet | ||
| // fn sendmsg_x( | ||
| // s: libc::c_int, | ||
| // msgp: *const msghdr_x, | ||
| // cnt: libc::c_uint, | ||
| // flags: libc::c_int, | ||
| // ) -> isize; |
There was a problem hiding this comment.
We can delete this and add later when required.
| Err(e) if e.kind() == io::ErrorKind::WouldBlock => break 0, | ||
| Err(e) if e.kind() == io::ErrorKind::Interrupted => continue, |
There was a problem hiding this comment.
What is the difference between break 0 and continue ?
It looks like both behave similar since we do not have anything after the loop.
Just curious, if you wanted to handle anything else by this differentiation
| io_error: Arc<Mutex<Option<io::Error>>>, | ||
| ) { | ||
| let mut recv_bufs = [[0u8; MAX_OUTSIDE_MTU]; BATCH_SIZE]; | ||
| let mut msg_lens = [0usize; BATCH_SIZE]; |
There was a problem hiding this comment.
| let mut msg_lens = [0usize; BATCH_SIZE]; | |
| let mut recv_bufs: [BytesMut; BATCH_SIZE] = | |
| std::array::from_fn(|_| BytesMut::with_capacity(MAX_OUTSIDE_MTU)); |
We could consider using BytesMut as array instead of static buffer to avoid the copying at line
https://github.com/expressvpn/lightway/pull/378/changes#diff-31acaecb4daa8d8ab8fbe35347045c8df09d58bf3520e98681d757ce94967a34R144
Later when we need to take the buffer int the above location, we can replace it like
std::mem::replace(&mut recv_bufs[i],BytesMut::with_capacity(MAX_OUTSIDE_MTU),)Prior art in WolfSsl
https://github.com/expressvpn/wolfssl-rs/blob/427c6f668966f47a9fc3f66287f9231dd3a07faf/wolfssl/src/ssl.rs#L517-L524
| return match receiver.pop_recv_consumer() { | ||
| Ok(b) => { | ||
| let len = b.len(); | ||
| buf.put_slice(&b[..]); |
There was a problem hiding this comment.
We could just take the recv buffer directly and drop the incoming buffer
It will avoid the copying of entire buffer
| buf.put_slice(&b[..]); | |
| *buf = b; |
Description
This PR brings speed improvement by switching Apple clients to use
recvmsg_xinstead of the regularrecv_fromfunction. Since the new function allow us to receive a batch of packets all at once, we would need to add something like a buffer to store the packets beforeoutside_io_taskpolls/fetches a new packet.I have opted to use rtrb, a SPSC ring buffer that also supports pushing multiple items at once with a fixed capacity. It's wait free and lock-free as well. Once we got packets from
recvmsg_x, it will push to the ring buffer all at once via a iterator. On the other hand, whenoutside_io_taskdecides to runrecv_bufto fetch a packet, it will then pop the ring buffer.There are cases where the ring buffer will be fully loaded (especially running a speedtest on 10Gbps line). If this happens, the
handle_udp_recvtokio task will yield so that it will let tasks likeoutside_io_taskto actually start popping the ring buffer.Missing in this PR, will do it separately:
recvmsg_xdoes existMotivation and Context
To bring speed improvements on different Apple platform
How Has This Been Tested?
Tested on a 10Gbps line connecting to one of the testing server with speedtest cli & iPerf3 on a M2 Pro MacBook Pro:
On iPhone 17 Pro Max with 5Gbps line:
Types of changes
Checklist:
main