updates to prevent cpu stall when CONFIG_PREEMPT is set by thebied-red · Pull Request #48 · wget/realtek-r8152-linux

thebied-red · 2025-05-22T19:33:05Z

tested successfully with kernel versions 5.4.0, 5.10

#define RTL_MAX_SG_NUM 16 // 16 is insufficient in the cases of highly fragmented packets which leads to a CPU stall and subsequent system hang

Setting RTL_MAX_SG_NUM to 32 alleviates the issue with CONFIG_PREEEMPT set to y

The other changes in the request are to ensure work is actually being completed on tx and to defer to other tasks when CONFIG_PREEMPT is set

Hopefully this helps others facing similar issues.

trace from CPU stall with v2.19.2:
[ 67.745833] rcu: INFO: rcu_preempt self-detected stall on CPU

[ 67.745844] rcu: 3-....: (1 GPs behind) idle=576/1/0x4000000000000002 softirq=11193/11194 fqs=5250

[ 67.745848] (t=21000 jiffies g=25445 q=36795)

[ 67.745851] Task dump for CPU 3:

[ 67.745854] wss R running task 0 1072 1 0x00000002

[ 67.745860] Call trace:

[ 67.745870] dump_backtrace+0x0/0x140

[ 67.745874] show_stack+0x14/0x20

[ 67.745879] sched_show_task+0x108/0x138

[ 67.745883] dump_cpu_task+0x40/0x50

[ 67.745888] rcu_dump_cpu_stacks+0x94/0xd0

[ 67.745891] rcu_sched_clock_irq+0x7bc/0xa08

[ 67.745895] update_process_times+0x2c/0x68

[ 67.745900] tick_sched_handle.isra.0+0x30/0x50

[ 67.745904] tick_sched_timer+0x48/0x98

[ 67.745908] __hrtimer_run_queues+0x110/0x1b0

[ 67.745911] hrtimer_interrupt+0xe4/0x240

[ 67.745916] arch_timer_handler_phys+0x30/0x40

[ 67.745921] handle_percpu_devid_irq+0x80/0x140

[ 67.745925] generic_handle_irq+0x24/0x38

[ 67.745929] __handle_domain_irq+0x60/0xb8

[ 67.745932] gic_handle_irq+0x5c/0xb8

[ 67.745936] el1_irq+0xb8/0x180

[ 67.745941] _raw_spin_unlock_irqrestore+0xc/0x48

[ 67.745946] tasklet_action_common.isra.0+0xcc/0x178

[ 67.745950] tasklet_action+0x24/0x30

[ 67.745953] __do_softirq+0x118/0x22c

[ 67.745956] do_softirq.part.0+0x60/0x68

[ 67.745961] __local_bh_enable_ip+0x8c/0x98

[ 67.745965] ip_finish_output2+0x17c/0x580

[ 67.745969] __ip_finish_output+0xc8/0x1f0

[ 67.745972] ip_output+0xf4/0x1a0

[ 67.745975] ip_local_out+0x44/0x58

[ 67.745978] __ip_queue_xmit+0x124/0x390

[ 67.745983] ip_queue_xmit+0x10/0x18

[ 67.745986] __tcp_transmit_skb+0x49c/0xab0

[ 67.745990] tcp_write_xmit+0x230/0x1098

[ 67.745993] __tcp_push_pending_frames+0x38/0xc8

[ 67.745998] tcp_push+0x118/0x168

[ 67.746002] tcp_sendmsg_locked+0xa54/0xb50

[ 67.746006] tcp_sendmsg+0x34/0x58

[ 67.746011] inet_sendmsg+0x40/0x68

[ 67.746016] ___sys_sendmsg+0x280/0x2c0

[ 67.746019] __sys_sendmsg+0x64/0xb8

[ 67.746022] __arm64_sys_sendmsg+0x20/0x28

[ 67.746028] el0_svc_common.constprop.0+0x68/0x160

[ 67.746032] el0_svc_handler+0x6c/0x88

[ 67.746035] el0_svc+0x8/0xc

…kernel versions 5.4.0, 5.10

hcartiaux · 2025-05-25T18:15:05Z

Awesome, though I never had this issue, I'll have a look and see how-to re-apply on top of 2.20.1 (#48).

Note that there's 0 chance for your PR to be taken and upstreamed from this repository. If you really want your code to be merged/reused, you should try to get it included in linux itself: https://github.com/torvalds/linux/blob/master/drivers/net/usb/r8152.c

thebied-red · 2025-05-27T20:12:53Z

Appreciate the guidance. I'll try up in the linux repository. Hopefully this will help someone in the interim.

ristomatti · 2025-09-30T22:16:48Z

As a last effort, I decided to try if your changes would fix the issue with NFS reported by several people around different forums (and also here #50). I did a similar test cycle as I commented on @jerryabramson's issue report maybe ~7 times without a single warning shown on dmesg. If this was just not just pure luck, this PR will make many people happy (myself included). 🤞

Did this fix the issues you were having @jerryabramson? I can't say much based on my couple of hours of testing, but it sure seems promising.

Edit: Reference the actual issue reporter, not the repo maintaner.

jerryabramson · 2025-10-01T14:55:21Z

I studied your commit, and it looks like you ended up having to write quite a bit of code.

I know next to nothing about authoring or changing Linux drivers but this seems like a pretty big change.

I don’t know the history of this driver, but shouldn’t Realtek own this, given it’s their hardware and driver?

The bug only appeared with the new 5 GbE adapters and their driver changes.

ristomatti · 2025-10-01T19:48:44Z

@jerryabramson I've been running this patched driver since yesterday and have been able to use NFS so far with no issues at all. I recommend giving it a shot.

I would not count on Realtek fixing this. The adapter doesn't seem to be supported even on the newest 6.16 kernel. In general I've had bad experiences with Realtek's ethernet/WiFi chips on Linux. Previously I've been able to find some workaround to make them usable. This was the first time I was ready to give up, but now I'm hopeful this could be the workaround I was looking for.

I know only basic level C and absolutely nothing about driver or kernel level development, but it occurred to me to ask an LLM analysis of this patch (and to test the new Claude 4.5 Sonnet model). In my prompt I only mentioned I've experienced CPU lockups and other stability issues when using the Realtek driver for NFS traffic and asked if this patch could make a difference. I included GitHub's .patch URL to fetch: https://github.com/red-fw/realtek-r8152-linux/commit/b762c0f5eec1408fbc85bdc4d20a4defd9e2ca0e.patch

Like is often the case, the analysis looks plausible when you know nothing about the subject. I'll add add the response here in case @thebied-red surfaces and can tell if it got it all wrong. 🙂

Claude 4.5 Sonnet analysis

I'll fetch and analyze that patch for you.

Excellent! This patch is highly relevant to your NFS soft lock issue. Let me break down why this patch could very well solve your problem:

Key Issues the Patch Addresses

1. CPU Stall Prevention via `cond_resched()`

The commit message explicitly mentions: "updates to prevent cpu stall when CONFIG_PREEMPT is set"

The patch adds conditional rescheduling points in two critical TX aggregation loops:

r8152_tx_agg_fill()
r8152_tx_agg_sg_fill()

Both functions now track how many packets they process and call cond_resched() every 8 packets. This allows the kernel scheduler to regain control and prevents the driver from monopolizing the CPU.

2. Increased Scatter-Gather Buffer Size

-#define RTL_MAX_SG_NUM		16
+#define RTL_MAX_SG_NUM		32

The comment is revealing: "Ensure RTL_MAX_SG_NUM is large enough to enqueue fragmented packet to prevent CPU stall"

This is crucial for NFS! NFS uses heavily fragmented packets, especially with larger rsize/wsize settings. With only 16 scatter-gather segments, fragmented NFS packets might not fit in a single aggregation, potentially causing the driver to spin indefinitely trying to process them.

3. TX Loop Budgeting

The tx_bottom() function now has:

An iteration limit (max_iterations_per_tasklet_run = 8)
Proper handling of unprocessable SKBs
Automatic rescheduling when the budget is exceeded

This prevents the driver from getting stuck in an infinite loop processing packets, which is a common cause of soft lockups.

Why This Matters for NFS

NFS typically:

Uses large packet sizes (8KB+ read/write sizes)
Creates highly fragmented network packets
Generates sustained, heavy network traffic
Can trigger edge cases in network drivers that don't handle fragmentation well

The original driver code would process packets in tight loops without yielding, and couldn't handle large fragmented packets properly. Under NFS load, this would cause:

CPU soft lockups (exactly what you're seeing)
System freezes
RCU stall warnings

Recommendation

Yes, apply this patch. The evidence strongly suggests it will help:

✅ Explicitly targets CPU stall issues
✅ Addresses fragmented packet handling (NFS's bread and butter)
✅ Adds proper rescheduling points
✅ Implements loop budgeting to prevent infinite loops

The patch appears well-thought-out with extensive comments and proper error handling. The author tested it on kernels 5.4.0 and 5.10, and while your 6.14.x kernel is newer, these scheduler-related changes should work fine.

Note: After applying, monitor for the soft lockup messages in dmesg. They should disappear under NFS load.

- Apply patch by Matthew Biederman <thebied@red.com> wget/realtek-r8152-linux#48

aagit · 2025-10-31T16:32:41Z

Hello,

Kudos for finding a solution. I also run into a softlockup trying this driver on a 5gpbs usb ethernet.

So I noticed this patch alters RTL_MAX_SG_NUM, but also alters the r8152_tx_agg_fill path that never runs if RTL_MAX_SG_NUM is meaningful.

Seeing this patch my guess would be disabling sg or setting RTL_MAX_SG_NUM to 32 might be enough to fix it.

sg-use should be possible to turn off with ethtool -k/K or something without tweaking the driver.

If you've the time to test changing only SG_NUM to 32 or only disabling sg (aka sg-use) with ethool and no change to the code, it'd be interesting to know the results. I may try again, when I find the time.

I was going to return the device with no time to fix the driver, but I changed my mind seeing this one thread :).

My view is if the change to SG_NUM is not well understood on why exactly it fixes the issue, if the non-sg path works with no change, given it resembles the upstream one, disabling sg may be the safer choice for the time being.

aagit · 2025-10-31T16:33:31Z

Also a side note: if this only breaks on preempt on, adding cond_resched with preempt on, is a noop.

ristomatti · 2025-11-02T05:57:19Z

Just reporting back after a month of testing. I've been using a driver patched with these changes on two hosts: one with kernel 6.14.x and one with 5.15.x. I haven't (at least) noticed any issues. NFS works, iperf3test gives ~4.8Gbit/s between these two hosts.

aagit · 2025-11-02T16:04:23Z

It seems changing RTL_MAX_SG_NUM 16 to 32 is sufficient workaround for the hang for me too. So probably ethtool -K enp sg off would also work around it. The full changes I've been running are below, but the second change didn't set sg off, it was still on by default.

With sg off it can't reach 5gbps more like 4gbps, but overall it might be safer.

diff '--color=auto' -ur realtek-r8152-unchanged-2.20.1.orig/r8152.c realtek-r8152-unchanged-2.20.1/r8152.c
--- realtek-r8152-unchanged-2.20.1.orig/r8152.c	2025-10-31 16:51:27.780642941 +0100
+++ realtek-r8152-unchanged-2.20.1/r8152.c	2025-10-31 23:45:02.618358065 +0100
@@ -827,7 +827,7 @@
 #define mtu_to_size(m)		((m) + VLAN_ETH_HLEN + ETH_FCS_LEN)
 #define size_to_mtu(s)		((s) - VLAN_ETH_HLEN - ETH_FCS_LEN)
 
-#define RTL_MAX_SG_NUM		16
+#define RTL_MAX_SG_NUM		32
 
 /* rtl8152 flags */
 enum rtl8152_flags {
@@ -29396,7 +29396,8 @@
 	tasklet_setup(&tp->tx_tl, bottom_half);
 	tasklet_disable(&tp->tx_tl);
 
-#if LINUX_VERSION_CODE >= KERNEL_VERSION(5,2,3)
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(5,2,3) && 0
+#error "sg_use might be unstable"
 	if (usb_device_no_sg_constraint(udev))
 		tp->sg_use = true;
 #endif /* LINUX_VERSION_CODE >= KERNEL_VERSION(5,2,3) */

The hang must be happening because r8152_tx_agg_sg_fill returns 0 endlessly despite skb_queue_empty isn't true.

One theory could be num_sgs for the skb at the head of the list is bigger than RTL_MAX_SG_NUM so then it loops forever. There should probably be a check in that paths that if sg_num == 0 and the below check fails, it takes the non SG path instead next time around. That's probably safer than depending on the 32 value.

		if ((num_sgs + sg_num) > max_sg_num) {

When processing transmit packets in tx_bottom, the code was calling r8152_tx_agg_sg_fill for the same skb even when the SG fill failed to consume the skb. This could happen if the condition "1 + skb_shinfo(skb)->nr_frags > RTL_MAX_SG_NUM" was true, leading to infinite loops and softlockups. The softlockup here was reproducible only when accessing huggingface.co with Firefox. With this patch, the warning is logged as expected and no softlockup occurs anymore: [212844.954702] r8152 4-1:1.0 enp5s0f4u1: num_sgs 18 > max_sg_num 16 Increasing RTL_MAX_SG_NUM to avoid the failure appears to work too but it looks more fragile. Furthermore RTL_MAX_SG_NUM is used throughout the driver to communicate with the network layer and the driver is meant to be able to consume that number of network buffers at once, which may not be the case if increasing the value. This change prevents the infinite loop by invoking the non SG fill to consume the skb if the SG fill cannot consume it. Link: wget#48 Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

aagit · 2025-11-03T12:28:08Z

#53

ristomatti · 2025-11-07T01:46:57Z

@aagit I'll try to find time to test if your patch works for me also. That means earliest next week. Not that it will likely make any difference since this PR has been open >5 months now.

hcartiaux · 2025-11-07T08:41:24Z

@aagit I'll try to find time to test if your patch works for me also. That means earliest next week. Not that it will likely make any difference since this PR has been open >5 months now.

@ristomatti No, it's just that I don't have the hardware to reproduce and test, and also, I have to take the time to understand because these patches will have to be maintained in the future and ported to module newer versions. But I'm still active.

ristomatti · 2025-11-22T14:51:11Z

@hcartiaux I understand that. Does this effectively mean this repository might never support models you don't have available for testing, no matter how many people would report a patch working?

hcartiaux · 2025-11-25T23:52:28Z

@hcartiaux I understand that. Does this effectively mean this repository might never support models you don't have available for testing, no matter how many people would report a patch working?

It was never intended for that, initially it was just an import of the realtek sources with minimal patches to make it compile on an up-to-date archlinux system. We needed a public source url for the r8152-dkms package on AUR.

Now, I'm not even the owner of the project, but if the goal changes, I can't do it alone and we would have to reorganize this repository

updates to prevent cpu stall when CONFIG_PREEMPT is set, tested with …

b762c0f

…kernel versions 5.4.0, 5.10

hcartiaux mentioned this pull request Jul 9, 2025

Realtek 5 GbE USB Network Dongles (WisdPi, but generic issue with 8157) cause NFS hangs and kernel crashes #50

Open

ristomatti added a commit to ristomatti/realtek-r8152-dkms that referenced this pull request Oct 16, 2025

Update Debian package version to 2.19.2-2

649abe9

- Apply patch by Matthew Biederman <thebied@red.com> wget/realtek-r8152-linux#48

ristomatti added a commit to ristomatti/realtek-r8152-dkms that referenced this pull request Oct 16, 2025

Update Debian package version to 2.20.1-2

c34cfd4

- Apply patch by Matthew Biederman <thebied@red.com> wget/realtek-r8152-linux#48

ristomatti added a commit to ristomatti/realtek-r8152-dkms that referenced this pull request Oct 16, 2025

Update Debian package version to 2.19.2-2

42e7e8d

- Apply patch by Matthew Biederman <thebied@red.com> wget/realtek-r8152-linux#48

aagit mentioned this pull request Nov 3, 2025

Fix softlockup in tx_bottom by preventing repeated failed SG fills #53

Open

Conversation

thebied-red commented May 22, 2025

Uh oh!

hcartiaux commented May 25, 2025

Uh oh!

thebied-red commented May 27, 2025

Uh oh!

ristomatti commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jerryabramson commented Oct 1, 2025

Uh oh!

ristomatti commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key Issues the Patch Addresses

1. CPU Stall Prevention via cond_resched()

2. Increased Scatter-Gather Buffer Size

3. TX Loop Budgeting

Why This Matters for NFS

Recommendation

Uh oh!

aagit commented Oct 31, 2025

Uh oh!

aagit commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ristomatti commented Nov 2, 2025

Uh oh!

aagit commented Nov 2, 2025

Uh oh!

aagit commented Nov 3, 2025

Uh oh!

ristomatti commented Nov 7, 2025

Uh oh!

hcartiaux commented Nov 7, 2025

Uh oh!

ristomatti commented Nov 22, 2025

Uh oh!

hcartiaux commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ristomatti commented Sep 30, 2025 •

edited

Loading

ristomatti commented Oct 1, 2025 •

edited

Loading

1. CPU Stall Prevention via `cond_resched()`

aagit commented Oct 31, 2025 •

edited

Loading

hcartiaux commented Nov 25, 2025 •

edited

Loading