Fix segfaults caused by race condition between ntlm close handler and client conn cleanup#3
Draft
Fix segfaults caused by race condition between ntlm close handler and client conn cleanup#3
Conversation
… conn cleanup Agent-Logs-Url: https://github.com/Securepoint/nginx-ntlm-module/sessions/39515b70-c60c-4797-87d5-a048779c5e98 Co-authored-by: dsabotta <262123149+dsabotta@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Fix segmentation fault in nginx due to race condition
Fix segfaults caused by race condition between ntlm close handler and client conn cleanup
May 5, 2026
7 tasks
Copilot AI
added a commit
that referenced
this pull request
May 7, 2026
- Rewrites ngx_http_upstream_ntlm_module.c with four documented security invariants: I1. Atomic item release via ngx_http_upstream_ntlm_item_release() helper I2. c->read->data (not c->data) for cache item storage on idle connections I3. OOM guard: cleanup-add failure aborts cache insertion (session-hijack fix) I4. Stale-credential eviction on re-auth with established session - Fixes from all open PRs (#3, #6, #7): - Synchronous cleanup handler (no posted events) - Session-hijack via cleanup-OOM (critical) - Stale-credential reuse on established connections (high) - client_connection NULL in eviction path - Fault-injection macro NGX_NTLM_TEST_CLEANUP_NULL for testing - Adds ntlm_time and ntlm_requests directives - Adds notify peer callback pass-through - Updates README.md for v2 with security section - Updates Docker files to nginx 1.28.0 / alpine 3.20 - Adds TEST 5 (stale credential eviction) and TEST 6 (OOM guard, SKIP) to t/001-sanity.t Agent-Logs-Url: https://github.com/Securepoint/nginx-ntlm-module/sessions/3343cf49-d4a7-4188-89b8-5a89fad4a7fe Co-authored-by: matthias-lay <163420385+matthias-lay@users.noreply.github.com>
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When a client connection closes,
ngx_http_upstream_client_conn_cleanupposted the peer read event and then immediately moved the cache item to the free queue. When that posted event later fired,ngx_http_upstream_ntlm_close_handleroperated on an item already in the free queue — removing and re-inserting it, causing queue corruption and segfaults.Changes
ngx_http_upstream_client_conn_cleanup: Replace the post-event approach with a direct, atomic close sequence — setitem->peer_connection = NULLfirst (acts as a sentinel), move item to free queue, then callngx_http_upstream_ntlm_close()directly. No posted event, no window for the race.ngx_http_upstream_ntlm_close_handler(close:label): Add a guard onitem->peer_connection == NULL— if the cleanup handler already processed this item, just close the connection and return without touching the queue.Original prompt
Problem
When using this module with nginx 1.28.3, segfaults occur in
ngx_http_upstream_ntlm_close_handlerandngx_http_upstream_client_conn_cleanupdue to a race condition between these two functions manipulating the same cache item's queue membership and peer connection.Root Cause
When a client connection closes,
ngx_http_upstream_client_conn_cleanup(line 371):item->peer_connection->read->timedout = 1item->peer_connection->readtongx_posted_eventsitemfrom the cache queue and inserts it into the free queue (lines 385-386)Later, when the posted event fires,
ngx_http_upstream_ntlm_close_handler(line 394) runs and:item = c->data— the same item that's now already in the free queueitem->peer_connection = NULLngx_queue_remove(&item->queue)— removes from the free queue (corruption!)conf->freeagain — double insertion / queue corruptionThis causes three types of crashes:
ngx_http_upstream_ntlm_close_handleraccessing corrupted queue/item datangx_http_upstream_handlerbecause queue corruption causes a freed/reused connection to still be found in cachengx_http_upstream_client_conn_cleanupbecauseitem->peer_connection->readis already freedFix
In
ngx_http_upstream_client_conn_cleanup: Instead of posting an event and then moving the queue item (which creates the race), close the peer connection directly and move the queue item to free in one atomic operation. Setitem->peer_connection = NULLbefore closing so the close handler can detect it's already been handled.In
ngx_http_upstream_ntlm_close_handler: Add a guard at theclose:label — ifitem->peer_connectionis already NULL, it means the cleanup handler already processed this item, so just close the connection and return without touching the queue.Patched
ngx_http_upstream_client_conn_cleanup:Patched
ngx_http_upstream_ntlm_close_handler(close label):Note:
ngx_http_upstream_ntlm_closeis declared static beforengx_http_upstream_client_conn_cleanup, so it may need to be forward-declared or the function order adjusted so that the cleanup function can call it.The following is the prior conversation context from the user's chat exploration (may be truncated):
User: We are using nginx (https://github.com/nginx/nginx/tree/release-1.28.3) with this module.
Some segfaults occure. Here are some backtraces:
core-nginx-11-65534-65534-614067-1777025800:
[New LWP 614067]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/libthread_db.so.1".
Core was generated by `nginx: worker process '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 ngx_http_upstream_ntlm_close_handler (ev=) at ../nginx-ntlm-module/ngx_http_upstream_ntlm_module.c:463
#0 ngx_http_upstream_ntlm_close_handler (ev=) at ../nginx-ntlm-module/ngx_http_upstream_ntlm_module.c:463
#1 0x0000561a3085cb58 in ngx_event_process_posted (cycle=cycle@entry=0x561a48238c60, posted=0x561a309494f0 <ngx_posted_events>) at src/event/ngx_event_posted.c:34
#2 0x0000561a3085c695 in ngx_process_events_and_timers (cycle=cycle@entry=0x561a48238c60) at src/event/ngx_event.c:263
#3 0x0000561a30865d55 in ngx_worker_process_cycle (cycle=cycle@entry=0x561a48238c60, data=data@entry=0x0) at src/os/unix/ngx_process_cycle.c:721
#4...
This pull request was created from Copilot chat.