workqueue_lockup: scan workqueue lockup issues#207
Conversation
Orabug: 39022047 Signed-off-by: Richard Li <tianqi.li@oracle.com>
| class WorkQueueLockup(CorelensModule): | ||
| """ | ||
| Detect workqueue lockup issues | ||
| """ | ||
|
|
||
| name = "workqueue_lockup" | ||
|
|
||
| def run(self, prog: Program, args: argparse.Namespace) -> None: | ||
| scan_workqueue_lockup(prog) |
There was a problem hiding this comment.
I'm trying to make sense of this in terms of which things it catches that our other corelens modules do not already address. We have the workqueue module which prints all workqueues. I can see how, even if a locked up workqueue appeared in that module's output, it may not be helpful if it's buried among all the system's workqueues. (Even LLMs may be overwhelmed by the extra output)
We have the lock module would only show a workqueue if it was stuck holding a supported lock type. Thus it would miss workqueues that are stuck on things other than mutexes/semaphores.
We have the lockup module which would show a CPU if it was stuck running the same task for too long. Thus it would miss workqueues that are stuck but off-CPU.
So the thing that this could catch, which isn't caught by any other module, is workqueue tasks which are off-cpu for a long time. Is that the motivation here? I ask because I'm wary of adding another lockup/hang detection module, rather than improving what we have. Also because the module uses task_lastrun2now() which seems to imply an on-CPU task.
There was a problem hiding this comment.
Hi @brenns10, the motivation behind this is really about detecting workqueue lockup issues and dumping them in a llm-friendly straightforward way. Like you mentioned, we already have relevant pieces here and there and have an option to integrate it into existing modules (lockup module may be a good candidate, with off-cpu tasks dumped). I am not opinionated about the overall architecture:)
Scan workqueue lockup based on wq_watchdog_thresh. Default to 30 seconds. Dumps those workqueues with current task on corresponding cpu and current worker task with call trace.