updates to prevent cpu stall when CONFIG_PREEMPT is set#48
Conversation
…kernel versions 5.4.0, 5.10
|
Awesome, though I never had this issue, I'll have a look and see how-to re-apply on top of 2.20.1 (#48). Note that there's 0 chance for your PR to be taken and upstreamed from this repository. If you really want your code to be merged/reused, you should try to get it included in linux itself: https://github.com/torvalds/linux/blob/master/drivers/net/usb/r8152.c |
|
Appreciate the guidance. I'll try up in the linux repository. Hopefully this will help someone in the interim. |
|
As a last effort, I decided to try if your changes would fix the issue with NFS reported by several people around different forums (and also here #50). I did a similar test cycle as I commented on @jerryabramson's issue report maybe ~7 times without a single warning shown on Did this fix the issues you were having @jerryabramson? I can't say much based on my couple of hours of testing, but it sure seems promising. Edit: Reference the actual issue reporter, not the repo maintaner. |
|
I studied your commit, and it looks like you ended up having to write quite a bit of code. I know next to nothing about authoring or changing Linux drivers but this seems like a pretty big change. I don’t know the history of this driver, but shouldn’t Realtek own this, given it’s their hardware and driver? The bug only appeared with the new 5 GbE adapters and their driver changes. |
|
@jerryabramson I've been running this patched driver since yesterday and have been able to use NFS so far with no issues at all. I recommend giving it a shot. I would not count on Realtek fixing this. The adapter doesn't seem to be supported even on the newest 6.16 kernel. In general I've had bad experiences with Realtek's ethernet/WiFi chips on Linux. Previously I've been able to find some workaround to make them usable. This was the first time I was ready to give up, but now I'm hopeful this could be the workaround I was looking for. I know only basic level C and absolutely nothing about driver or kernel level development, but it occurred to me to ask an LLM analysis of this patch (and to test the new Claude 4.5 Sonnet model). In my prompt I only mentioned I've experienced CPU lockups and other stability issues when using the Realtek driver for NFS traffic and asked if this patch could make a difference. I included GitHub's Like is often the case, the analysis looks plausible when you know nothing about the subject. I'll add add the response here in case @thebied-red surfaces and can tell if it got it all wrong. 🙂 Claude 4.5 Sonnet analysisI'll fetch and analyze that patch for you. Excellent! This patch is highly relevant to your NFS soft lock issue. Let me break down why this patch could very well solve your problem: Key Issues the Patch Addresses1. CPU Stall Prevention via
|
- Apply patch by Matthew Biederman <thebied@red.com> wget/realtek-r8152-linux#48
- Apply patch by Matthew Biederman <thebied@red.com> wget/realtek-r8152-linux#48
- Apply patch by Matthew Biederman <thebied@red.com> wget/realtek-r8152-linux#48
|
Hello, Kudos for finding a solution. I also run into a softlockup trying this driver on a 5gpbs usb ethernet. So I noticed this patch alters RTL_MAX_SG_NUM, but also alters the r8152_tx_agg_fill path that never runs if RTL_MAX_SG_NUM is meaningful. Seeing this patch my guess would be disabling sg or setting RTL_MAX_SG_NUM to 32 might be enough to fix it. sg-use should be possible to turn off with ethtool -k/K or something without tweaking the driver. If you've the time to test changing only SG_NUM to 32 or only disabling sg (aka sg-use) with ethool and no change to the code, it'd be interesting to know the results. I may try again, when I find the time. I was going to return the device with no time to fix the driver, but I changed my mind seeing this one thread :). My view is if the change to SG_NUM is not well understood on why exactly it fixes the issue, if the non-sg path works with no change, given it resembles the upstream one, disabling sg may be the safer choice for the time being. |
|
Also a side note: if this only breaks on preempt on, adding cond_resched with preempt on, is a noop. |
|
Just reporting back after a month of testing. I've been using a driver patched with these changes on two hosts: one with kernel 6.14.x and one with 5.15.x. I haven't (at least) noticed any issues. NFS works, |
|
It seems changing RTL_MAX_SG_NUM 16 to 32 is sufficient workaround for the hang for me too. So probably With sg off it can't reach 5gbps more like 4gbps, but overall it might be safer. The hang must be happening because r8152_tx_agg_sg_fill returns 0 endlessly despite skb_queue_empty isn't true. One theory could be num_sgs for the skb at the head of the list is bigger than RTL_MAX_SG_NUM so then it loops forever. There should probably be a check in that paths that if sg_num == 0 and the below check fails, it takes the non SG path instead next time around. That's probably safer than depending on the 32 value. |
When processing transmit packets in tx_bottom, the code was calling r8152_tx_agg_sg_fill for the same skb even when the SG fill failed to consume the skb. This could happen if the condition "1 + skb_shinfo(skb)->nr_frags > RTL_MAX_SG_NUM" was true, leading to infinite loops and softlockups. The softlockup here was reproducible only when accessing huggingface.co with Firefox. With this patch, the warning is logged as expected and no softlockup occurs anymore: [212844.954702] r8152 4-1:1.0 enp5s0f4u1: num_sgs 18 > max_sg_num 16 Increasing RTL_MAX_SG_NUM to avoid the failure appears to work too but it looks more fragile. Furthermore RTL_MAX_SG_NUM is used throughout the driver to communicate with the network layer and the driver is meant to be able to consume that number of network buffers at once, which may not be the case if increasing the value. This change prevents the infinite loop by invoking the non SG fill to consume the skb if the SG fill cannot consume it. Link: wget#48 Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
|
@aagit I'll try to find time to test if your patch works for me also. That means earliest next week. Not that it will likely make any difference since this PR has been open >5 months now. |
@ristomatti No, it's just that I don't have the hardware to reproduce and test, and also, I have to take the time to understand because these patches will have to be maintained in the future and ported to module newer versions. But I'm still active. |
|
@hcartiaux I understand that. Does this effectively mean this repository might never support models you don't have available for testing, no matter how many people would report a patch working? |
It was never intended for that, initially it was just an import of the realtek sources with minimal patches to make it compile on an up-to-date archlinux system. We needed a public source url for the r8152-dkms package on AUR. Now, I'm not even the owner of the project, but if the goal changes, I can't do it alone and we would have to reorganize this repository |
tested successfully with kernel versions 5.4.0, 5.10
#define RTL_MAX_SG_NUM 16 // 16 is insufficient in the cases of highly fragmented packets which leads to a CPU stall and subsequent system hang
Setting RTL_MAX_SG_NUM to 32 alleviates the issue with CONFIG_PREEEMPT set to y
The other changes in the request are to ensure work is actually being completed on tx and to defer to other tasks when CONFIG_PREEMPT is set
Hopefully this helps others facing similar issues.
trace from CPU stall with v2.19.2:
[ 67.745833] rcu: INFO: rcu_preempt self-detected stall on CPU
[ 67.745844] rcu: 3-....: (1 GPs behind) idle=576/1/0x4000000000000002 softirq=11193/11194 fqs=5250
[ 67.745848] (t=21000 jiffies g=25445 q=36795)
[ 67.745851] Task dump for CPU 3:
[ 67.745854] wss R running task 0 1072 1 0x00000002
[ 67.745860] Call trace:
[ 67.745870] dump_backtrace+0x0/0x140
[ 67.745874] show_stack+0x14/0x20
[ 67.745879] sched_show_task+0x108/0x138
[ 67.745883] dump_cpu_task+0x40/0x50
[ 67.745888] rcu_dump_cpu_stacks+0x94/0xd0
[ 67.745891] rcu_sched_clock_irq+0x7bc/0xa08
[ 67.745895] update_process_times+0x2c/0x68
[ 67.745900] tick_sched_handle.isra.0+0x30/0x50
[ 67.745904] tick_sched_timer+0x48/0x98
[ 67.745908] __hrtimer_run_queues+0x110/0x1b0
[ 67.745911] hrtimer_interrupt+0xe4/0x240
[ 67.745916] arch_timer_handler_phys+0x30/0x40
[ 67.745921] handle_percpu_devid_irq+0x80/0x140
[ 67.745925] generic_handle_irq+0x24/0x38
[ 67.745929] __handle_domain_irq+0x60/0xb8
[ 67.745932] gic_handle_irq+0x5c/0xb8
[ 67.745936] el1_irq+0xb8/0x180
[ 67.745941] _raw_spin_unlock_irqrestore+0xc/0x48
[ 67.745946] tasklet_action_common.isra.0+0xcc/0x178
[ 67.745950] tasklet_action+0x24/0x30
[ 67.745953] __do_softirq+0x118/0x22c
[ 67.745956] do_softirq.part.0+0x60/0x68
[ 67.745961] __local_bh_enable_ip+0x8c/0x98
[ 67.745965] ip_finish_output2+0x17c/0x580
[ 67.745969] __ip_finish_output+0xc8/0x1f0
[ 67.745972] ip_output+0xf4/0x1a0
[ 67.745975] ip_local_out+0x44/0x58
[ 67.745978] __ip_queue_xmit+0x124/0x390
[ 67.745983] ip_queue_xmit+0x10/0x18
[ 67.745986] __tcp_transmit_skb+0x49c/0xab0
[ 67.745990] tcp_write_xmit+0x230/0x1098
[ 67.745993] __tcp_push_pending_frames+0x38/0xc8
[ 67.745998] tcp_push+0x118/0x168
[ 67.746002] tcp_sendmsg_locked+0xa54/0xb50
[ 67.746006] tcp_sendmsg+0x34/0x58
[ 67.746011] inet_sendmsg+0x40/0x68
[ 67.746016] ___sys_sendmsg+0x280/0x2c0
[ 67.746019] __sys_sendmsg+0x64/0xb8
[ 67.746022] __arm64_sys_sendmsg+0x20/0x28
[ 67.746028] el0_svc_common.constprop.0+0x68/0x160
[ 67.746032] el0_svc_handler+0x6c/0x88
[ 67.746035] el0_svc+0x8/0xc