Skip to content

workqueue_lockup: scan workqueue lockup issues#207

Open
richl9 wants to merge 1 commit into
oracle-samples:mainfrom
richl9:richard/workqueue-lockup
Open

workqueue_lockup: scan workqueue lockup issues#207
richl9 wants to merge 1 commit into
oracle-samples:mainfrom
richl9:richard/workqueue-lockup

Conversation

@richl9
Copy link
Copy Markdown
Contributor

@richl9 richl9 commented Feb 27, 2026

Scan workqueue lockup based on wq_watchdog_thresh. Default to 30 seconds. Dumps those workqueues with current task on corresponding cpu and current worker task with call trace.

Orabug: 39022047
Signed-off-by: Richard Li <tianqi.li@oracle.com>
@oracle-contributor-agreement oracle-contributor-agreement Bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Feb 27, 2026
Comment on lines +122 to +130
class WorkQueueLockup(CorelensModule):
"""
Detect workqueue lockup issues
"""

name = "workqueue_lockup"

def run(self, prog: Program, args: argparse.Namespace) -> None:
scan_workqueue_lockup(prog)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to make sense of this in terms of which things it catches that our other corelens modules do not already address. We have the workqueue module which prints all workqueues. I can see how, even if a locked up workqueue appeared in that module's output, it may not be helpful if it's buried among all the system's workqueues. (Even LLMs may be overwhelmed by the extra output)

We have the lock module would only show a workqueue if it was stuck holding a supported lock type. Thus it would miss workqueues that are stuck on things other than mutexes/semaphores.

We have the lockup module which would show a CPU if it was stuck running the same task for too long. Thus it would miss workqueues that are stuck but off-CPU.

So the thing that this could catch, which isn't caught by any other module, is workqueue tasks which are off-cpu for a long time. Is that the motivation here? I ask because I'm wary of adding another lockup/hang detection module, rather than improving what we have. Also because the module uses task_lastrun2now() which seems to imply an on-CPU task.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @brenns10, the motivation behind this is really about detecting workqueue lockup issues and dumping them in a llm-friendly straightforward way. Like you mentioned, we already have relevant pieces here and there and have an option to integrate it into existing modules (lockup module may be a good candidate, with off-cpu tasks dumped). I am not opinionated about the overall architecture:)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

OCA Verified All contributors have signed the Oracle Contributor Agreement.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants