Skip to content

Add deadlock detection plugin for XRootD OSS and authorization operations#125

Draft
Copilot wants to merge 5 commits into
mainfrom
copilot/create-deadlock-detection-plugin
Draft

Add deadlock detection plugin for XRootD OSS and authorization operations#125
Copilot wants to merge 5 commits into
mainfrom
copilot/create-deadlock-detection-plugin

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Dec 10, 2025

Deadlock Detection Plugin Implementation Plan

  • Create core deadlock detection class (DeadlockDetector.hh/cc)

    • Implement RAII object with doubly linked list management
    • Create 15 cache-aligned lists for Linux (cpuid % 15)
    • Create single list for non-Linux platforms
    • Implement background monitoring thread
    • Add configurable timeout via XRootD config
    • Add optional atomic log file writing for events
    • Implement module load/unload with thread lifecycle
    • Join monitor thread on shutdown (no arbitrary sleep)
    • Optimize list access with local references
  • Create OSS wrapper plugin (DeadlockOss.hh/cc)

    • Wrap all XrdOss operations with deadlock detection
    • Skip monitoring for newDir/newFile (not blocking operations)
    • Forward calls to wrapped OSS instance
    • Apply RAII pattern to detect blocking calls
  • Create authorization wrapper plugin (DeadlockAcc.hh/cc)

    • Wrap XrdAccAuthorize operations
    • Implement proper plugin stacking with dlopen
    • Forward calls to wrapped authorization instance
    • Apply RAII pattern to detect blocking calls
  • Create unit tests (deadlock_tests.cc)

    • Multi-threaded rapid create/destroy test
    • Death test for SIGKILL trigger
    • Positive test (no kill when below threshold)
    • Fixed test that expected uninitialized monitor
  • Create integration tests

    • Setup/teardown/test scripts for OSS wrapper
    • Dummy modules that stall to trigger SIGKILL
    • Test confirms process is killed on timeout
  • Update CMakeLists.txt

    • Add new libraries and test targets
    • Configure proper linking and dependencies
    • Add dummy stall OSS for testing
  • Update README.md

    • Document configuration options
    • Provide usage examples
    • Explain timeout and log file settings
    • Remove "How It Works" subsection
    • Update authorization wrapper documentation
Original prompt

This section details on the original issue you should resolve

<issue_title>Create deadlock detection plugin</issue_title>
<issue_description>Create a new class that performs deadlock detection.

A deadlock is defined to be when a blocking call doesn't return after a preset max time (should be in the configurable via xrootd config).

The goal is to create a RAII object that adds itself to a doubly linked list on creation and removes itself on deletion. The linked lists should be created on module load. They should be 15 statically allocated lists; when the object is created, it should get the cpuid of the current thread modulo 15 and select the resulting list for its lifetime. On non-Linux, have a single list and always choose it.

Adding and removing objects from a list should require locking a mutex associated with the list. The array containing the mutex and list head element should be cache line aligned to avoid false sharing.

On module load, a background thread should be created to periodically walk through each list and see if any meet the deadlock criteria (time elapsed since the object was created is over threshold). If it is, then an error should be logged a the process killed with SIGKILL. If configured to do so, atomically write the time of the event to a separate log file that can be monitored.

On module unload, a condition variable should be notified that causes the thread to exit (join the background thread from the main one).

Once the deadlock detection object is designed, create an OSS and xrootd authorization plugin (see github.com/xrootd/xrootd) wrapper that creates the object on the stack and then forwards the call to the wrapped object.

Create unit tests. At least one should launch multiple threads each of which rapidly create and destroy deadlock detection objects. One test should be a "death test" that triggers the SIGKILL.

Create an integration test that wraps both the auth module and OSS. Wrap around dummy modules that simply stall out, triggering the SIGKILL. Ensure that the xrootd process is killed. Have a positive test as well, making sure nothing happens if the runtime is below threshold.

Update the README to indicate how to use the new deadlock detection.</issue_description>

Comments on the Issue (you are @copilot in this section)


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Copilot AI and others added 2 commits December 10, 2025 02:42
Co-authored-by: bbockelm <1093447+bbockelm@users.noreply.github.com>
Co-authored-by: bbockelm <1093447+bbockelm@users.noreply.github.com>
Copilot AI changed the title [WIP] Add deadlock detection plugin implementation Add deadlock detection plugin for XRootD OSS and authorization operations Dec 10, 2025
Copilot AI requested a review from bbockelm December 10, 2025 02:47
Comment thread src/DeadlockDetector.cc
Copy link
Copy Markdown
Collaborator

@bbockelm bbockelm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot - please address review requests and inspect / fix unit test failures.

Comment thread src/DeadlockOss.cc

DeadlockOss::~DeadlockOss() {}

XrdOssDF *DeadlockOss::newDir(const char *user) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can skip deadlock detection for newDir and newFile.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed deadlock monitoring from newDir and newFile in commit 79d693e.

Comment thread src/DeadlockAcc.cc
Comment thread src/DeadlockDetector.cc Outdated
m_shutdown_cv.notify_one();

// Give the thread a moment to exit
std::this_thread::sleep_for(std::chrono::milliseconds(100));
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Join the thread instead of having an arbitrary sleep.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed Shutdown() to join the monitor thread instead of sleeping. Added static m_monitor_thread member. See commit 79d693e.

Comment thread src/DeadlockDetector.cc Outdated

DeadlockMonitor::DeadlockMonitor(const char *operation) {
// Don't create monitors if not initialized
if (!DeadlockDetector::m_initialized.load(std::memory_order_acquire)) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assume if the plugin is loaded then it is initialized. Skip this check.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed initialization check from DeadlockMonitor constructor - now assumes plugin is initialized when loaded. See commit 79d693e.

Comment thread src/DeadlockDetector.cc Outdated

// Add to the list
auto &detector = DeadlockDetector::GetInstance();
std::unique_lock<std::mutex> lock(detector.m_lists[m_list_id].m_mutex);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get a reference to the list head and use it for the remainder of the function.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated DeadlockMonitor constructor and destructor to get a reference to list_head and use it throughout. See commit 79d693e.

Comment thread README.md Outdated
deadlock.logfile /var/log/xrootd/deadlocks.log
```

#### How It Works
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This subsection can be skipped

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed "How It Works" subsection from README. See commit 79d693e.

Copilot AI and others added 2 commits December 10, 2025 03:09
…d on shutdown, optimize list access, implement auth stacking

Co-authored-by: bbockelm <1093447+bbockelm@users.noreply.github.com>
Co-authored-by: bbockelm <1093447+bbockelm@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Dec 10, 2025

@copilot - please address review requests and inspect / fix unit test failures.

Addressed all review feedback and fixed unit test issue:

  1. Removed deadlock monitoring from newDir/newFile (commit 79d693e)
  2. Implemented proper authorization plugin stacking with dlopen (commit 79d693e)
  3. Changed Shutdown() to join the monitor thread (commit 79d693e)
  4. Removed initialization check in DeadlockMonitor constructor (commit 79d693e)
  5. Optimized list access with local references (commit 79d693e)
  6. Removed "How It Works" subsection from README (commit 79d693e)
  7. Fixed unit test that expected uninitialized monitor to work (commit 4a2a76d)

All changes are ready for review.

Comment thread src/DeadlockAcc.cc
Comment thread src/DeadlockAcc.cc
Comment thread src/DeadlockAcc.cc
Comment thread src/DeadlockAcc.cc
Comment thread src/DeadlockAcc.cc
Comment thread src/DeadlockAcc.cc
Comment thread src/DeadlockAcc.cc
Comment thread src/DeadlockAcc.cc
dlclose(lib_handle);
return nullptr;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[lint] reported by reviewdog 🐶

Suggested change

Comment thread src/DeadlockOss.cc
Comment on lines +66 to +67
int DeadlockOss::Reloc(const char *tident, const char *path,
const char *cgName, const char *anchor) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[lint] reported by reviewdog 🐶

Suggested change
int DeadlockOss::Reloc(const char *tident, const char *path,
const char *cgName, const char *anchor) {
int DeadlockOss::Reloc(const char *tident, const char *path, const char *cgName,
const char *anchor) {

Comment thread src/DeadlockOss.cc
Comment on lines +77 to +78
int DeadlockOss::Rename(const char *oPath, const char *nPath,
XrdOucEnv *oEnvP, XrdOucEnv *nEnvP) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[lint] reported by reviewdog 🐶

Suggested change
int DeadlockOss::Rename(const char *oPath, const char *nPath,
XrdOucEnv *oEnvP, XrdOucEnv *nEnvP) {
int DeadlockOss::Rename(const char *oPath, const char *nPath, XrdOucEnv *oEnvP,
XrdOucEnv *nEnvP) {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create deadlock detection plugin

2 participants