Skip to content

1210: Fix coredump in http_client during rapid connection destruction#1476

Open
baemyung wants to merge 2 commits into
ibm-openbmc:1210from
baemyung:1210-fix-http-client-coredump-by-rapid-conn-close
Open

1210: Fix coredump in http_client during rapid connection destruction#1476
baemyung wants to merge 2 commits into
ibm-openbmc:1210from
baemyung:1210-fix-http-client-coredump-by-rapid-conn-close

Conversation

@baemyung
Copy link
Copy Markdown
Contributor

Issue:
BMCWeb crashes with segmentation fault when HTTP client connections are rapidly created and destroyed while async operations are pending. This occurs during event subscription deletion, BMCWeb shutdown, or network timeout scenarios where ConnectionPool is destroyed while callbacks are still executing.

Root Cause:
Race condition between async callback execution and ConnectionPool destruction leads to three failure modes:

  1. Out-of-bounds vector access in sendNext() when connId is invalid
  2. Null pointer dereference when connection pointer is null
  3. Use-after-free when callback executes after pool destruction

The issue manifests as:

  • Segmentation fault in sendNext()
  • Crash in afterSendData() when accessing destroyed pool
  • Invalid memory access during callback execution

Key Fixes:

  1. Bounds Checking

    • Validate connId < connections.size() before vector access
    • Prevents segmentation fault from out-of-bounds access
    • Returns early with error log if connId invalid
  2. Null Pointer Validation

    • Check if connection pointer is null before dereferencing
    • Provides graceful error handling instead of crash
    • Logs error and returns safely
  3. Reordered afterSendData()

    • Check ConnectionPool validity BEFORE calling user callback
    • Prevents callback execution if pool has been destroyed
    • Ensures safe early return if pool no longer exists

Testing:
Reproduced coredump by simulating aggressive client behavior:

  • Client rapidly creates HTTP connections to BMCWeb
  • Client sends POST requests to create event subscriptions
  • Client abruptly closes connections during SSL handshake
  • Client destroys sessions while async operations are pending
  • This triggers race condition where callbacks execute after ConnectionPool destruction

Example Python code that reproduces the issue:

  for i in range(100):
      session = requests.Session()
      session.post('https://bmc/redfish/v1/EventService/Subscriptions',
                   json={'Destination': 'http://192.0.2.1:8080/events'})
      time.sleep(0.01)  # Brief delay
      session.close()   # Abrupt destruction while request pending
      del session

Before fix:

  • Segmentation fault in sendNext()
  • BMCWeb process crashes
  • Coredump generated

After fix:

  • No coredumps detected
  • BMCWeb process remains stable
  • Graceful error handling with log messages: "Invalid connId: X (size: Y)" "Connection at index X is null" "Failed to capture connection"

Expected logs during aggressive client testing (normal behavior):

  • SSL handshake failures (client abruptly disconnects)
  • Authentication failures (connections close during auth)
  • No segmentation faults or process crashes

Verification:
coredumpctl list bmcweb # No new coredumps
systemctl status bmcweb # Process still running
dmesg | grep segfault # No segmentation faults

@baemyung
Copy link
Copy Markdown
Contributor Author

baemyung commented May 21, 2026

@baemyung baemyung marked this pull request as draft May 21, 2026 12:46
@baemyung baemyung marked this pull request as ready for review May 22, 2026 00:01
@baemyung baemyung force-pushed the 1210-fix-http-client-coredump-by-rapid-conn-close branch from e922687 to 008fff0 Compare May 26, 2026 17:48
Copy link
Copy Markdown
Contributor

@jnin-dev jnin-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see anything wrong, looks good to me

baemyung added 2 commits May 29, 2026 08:08
Issue:
BMCWeb crashes with segmentation fault when HTTP client connections are
rapidly created and destroyed while async operations are pending. This
occurs during event subscription deletion, BMCWeb shutdown, or network
timeout scenarios where ConnectionPool is destroyed while callbacks are
still executing.

Root Cause:
Race condition between async callback execution and ConnectionPool
destruction leads to three failure modes:
1. Out-of-bounds vector access in sendNext() when connId is invalid
2. Null pointer dereference when connection pointer is null
3. Use-after-free when callback executes after pool destruction

The issue manifests as:
- Segmentation fault in sendNext()
- Crash in afterSendData() when accessing destroyed pool
- Invalid memory access during callback execution

Key Fixes:
1. Bounds Checking
   - Validate connId < connections.size() before vector access
   - Prevents segmentation fault from out-of-bounds access
   - Returns early with error log if connId invalid

2. Null Pointer Validation
   - Check if connection pointer is null before dereferencing
   - Provides graceful error handling instead of crash
   - Logs error and returns safely

3. Reordered afterSendData()
   - Check ConnectionPool validity BEFORE calling user callback
   - Prevents callback execution if pool has been destroyed
   - Ensures safe early return if pool no longer exists

Testing:
Reproduced coredump by simulating aggressive client behavior:
- Client rapidly creates HTTP connections to BMCWeb
- Client sends POST requests to create event subscriptions
- Client abruptly closes connections during SSL handshake
- Client destroys sessions while async operations are pending
- This triggers race condition where callbacks execute after
  ConnectionPool destruction

Example Python code that reproduces the issue:
```
  for i in range(100):
      session = requests.Session()
      session.post('https://bmc/redfish/v1/EventService/Subscriptions',
                   json={'Destination': 'http://192.0.2.1:8080/events'})
      time.sleep(0.01)  # Brief delay
      session.close()   # Abrupt destruction while request pending
      del session
```

Before fix:
- Segmentation fault in sendNext()
- BMCWeb process crashes
- Coredump generated

After fix:
- No coredumps detected
- BMCWeb process remains stable
- Graceful error handling with log messages:
  "Invalid connId: X (size: Y)"
  "Connection at index X is null"
  "Failed to capture connection"

Expected logs during aggressive client testing (normal behavior):
- SSL handshake failures (client abruptly disconnects)
- Authentication failures (connections close during auth)
- No segmentation faults or process crashes

Verification:
  coredumpctl list bmcweb        # No new coredumps
  systemctl status bmcweb        # Process still running
  dmesg | grep segfault          # No segmentation faults

Change-Id: I7b7a61aa8fcce1e32b5e3893da5ff0d5b3e3344f
Signed-off-by: Myung Bae <myungbae@us.ibm.com>
Some http_client async callbacks in ConnectionInfo used raw `this`
pointers in std::bind_front(), causing use-after-free when the object
was destroyed before callbacks completed.

The crash occurs at:

```
 ibm-openbmc#8  std::__throw_bad_function_call()
 ibm-openbmc#9  crow::ConnectionInfo::afterRead (this=0x14dc1a0, ec=..., bytesTransferred=<optimized out>)
     at /usr/src/debug/bmcweb/1.0+git/http/http_client.hpp:390
```

The matching code is
```
    void afterRead(const std::shared_ptr<ConnectionInfo>& /*self*/,
                   const boost::beast::error_code& ec,
                   const std::size_t bytesTransferred)
    {
       ...
        // Make sure the received response code is valid as defined by
        // the associated retry policy
        if (connPolicy->invalidResp(respCode))  <---
        {
```

In summary,
```
if (connPolicy->invalidResp(respCode))
The std::bad_function_call exception implies that the function is either
invalid, or its anchored object is invalid.
In this case, it seems the object (e.g. this or connPolicy) is invalid.

The code uses:
```
std::bind_front(&ConnectionInfo::afterResolve, this, shared_from_this())
```

While shared_from_this() is passed as a parameter, std::bind_front binds
this (raw pointer) as the implicit first parameter for the member
function call (e.g. afterResolve).

By making sure `this (self)` comes from as `shared_from_this()`,
`this`will be valid as long as `shared_from_this()` is valid.
``
    [self = shared_from_this()](const boost::beast::error_code& ec,
                                            size_t bytesTransferred) {
         self->afterRead(self, ec, bytesTransferred);
    }
```

Tested:
- Redfish Service Validator passes
- It is not easy to reproduce the issue manually, but I made sure
  that it is working under stress run to invoke many of the client
  connection and disconnection (as much as possible).

Change-Id: I9a840f4f99521038a40064c889710c35e6be2480
Signed-off-by: Myung Bae <myungbae@us.ibm.com>
@baemyung baemyung force-pushed the 1210-fix-http-client-coredump-by-rapid-conn-close branch from 008fff0 to c049e43 Compare May 29, 2026 13:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants