Skip to content

fix memory leak: free g_tid in Driver::runTest#32

Open
sahitya-chandra wants to merge 8 commits into
osdldbt:mainfrom
sahitya-chandra:fix/driver-g_tid-memory-leak
Open

fix memory leak: free g_tid in Driver::runTest#32
sahitya-chandra wants to merge 8 commits into
osdldbt:mainfrom
sahitya-chandra:fix/driver-g_tid-memory-leak

Conversation

@sahitya-chandra

Copy link
Copy Markdown
Contributor

Summary

Fixes a memory leak in src/Driver/Driver.cpp where g_tid was allocated with malloc but never freed.

Changes

  • Free g_tid after all worker threads are joined at the end of runTest().
  • Free g_tid before throwing when pthread_join fails so the allocation is not leaked on the error path.
  • Set g_tid = NULL after freeing to avoid use-after-free.
  • Add #include <cstdlib> for free().

Comment thread src/Driver/Driver.cpp Outdated
// 0 represents the Data-Maintenance thread
for (int i = 0; i <= iUsers; i++) {
if (pthread_join(g_tid[i], NULL) != 0) {
free(g_tid);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is destroying all the handles to all the threads if there is an issues joining one of them right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes sir, free(g_tid) frees the whole array, so we lose every pthread_t in one go, including threads we haven’t joined yet.

That’s intentional on this path: pthread_join failing is treated as fatal and we throw immediately afterward, so the normal driver run doesn’t continue. We still free(g_tid) so we don’t leak if the exception is ever caught or for static analysis.

I’ve added a short comment above that free in the error branch documenting this tradeoff.

@markwkm markwkm Mar 23, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a little heavy handed if threads are still running, or rather maybe what we should really do based should be based on the situation. Needs some more thought...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what went wrong with my current approach is that if we free(g_tid) as soon as one pthread_join fails, we throw away every handle in one shot. Other threads can still be running or at least still need a join, so that’s heavy-handed (as pointed out by you also): we’ve lost the ability to join them even though we might still want to try.

@markwkm markwkm marked this pull request as ready for review March 10, 2026 23:51
@markwkm markwkm marked this pull request as draft March 10, 2026 23:51
@markwkm markwkm marked this pull request as ready for review March 10, 2026 23:52
@markwkm markwkm marked this pull request as draft March 10, 2026 23:52
@sahitya-chandra sahitya-chandra requested a review from markwkm March 23, 2026 19:02
@sahitya-chandra sahitya-chandra marked this pull request as ready for review March 23, 2026 19:18
@markwkm

markwkm commented Mar 24, 2026

Copy link
Copy Markdown
Contributor

@sahitya-chandra Since you are vying for a Google Summer of Code mentorship, I'm going to respond in an unconventional way. What I was looking for is discussion that shows understanding of what the program is meant to be doing and what it should be doing instead. Your code pushes after each of my comments indicates you think there is a binary response to my questions.

@sahitya-chandra

sahitya-chandra commented Mar 24, 2026

Copy link
Copy Markdown
Contributor Author

You’re right @markwkm sir, I treated your notes as “fix the code” instead of stepping back and explaining what it does, first

What this code is doing: runTest starts the data-maintenance thread plus iUsers customer threads and stores their pthread_t values in g_tid. The join loop is meant to block until all of those threads have finished, so we don’t return from the test while work is still running.

@sahitya-chandra

Copy link
Copy Markdown
Contributor Author

I think what went wrong with my current approach is that if we free(g_tid) as soon as one pthread_join fails, we throw away every handle in one shot. Other threads can still be running or at least still need a join, so that’s heavy-handed (as pointed out by you also): we’ve lost the ability to join them even though we might still want to try.

I think the better approach would be to attempt pthread_join on every slot (0 … iUsers) even if an earlier join fails, then free(g_tid) once, and only then throw if any join failed. That keeps handles valid until we’ve tried every join, avoids leaking g_tid, and reports failure after we’ve done what we can.

@markwkm

markwkm commented Apr 1, 2026

Copy link
Copy Markdown
Contributor

I think the better approach would be to attempt pthread_join on every slot (0 … iUsers) even if an earlier join fails, then free(g_tid) once, and only then throw if any join failed. That keeps handles valid until we’ve tried every join, avoids leaking g_tid, and reports failure after we’ve done what we can.

I think that sounds better. Now if you're going to go through that effort, it may be worth identifying further the type of errors that maybe thrown and what is worth doing about them.

@sahitya-chandra

Copy link
Copy Markdown
Contributor Author

@markwkm pthread_join can fail with three errors, and each one means something different in this program's context:

ESRCH - no thread with that ID exists. This is the most realistic failure here. It could happen if a pthread_t slot was never properly written by pthread_create (though right now both entryCustomerWorkerThread and entryDMWorkerThread call exit(1) on creation failure, so if we reach the join loop all slots should be valid). It could also mean the thread already terminated and was reaped. Either way the thread is gone, there's nothing to wait for, so this is more of a warning than a fatal condition.

EINVAL - the thread is not joinable (detached), or another thread is already trying to join it. This shouldn't happen because pthread_attr_init defaults to joinable and nothing in the code detaches these threads. If it does occur, it points to a programming bug.

EDEADLK - deadlock detected (a thread trying to join itself, or a circular join). Can't happen here since the main thread is joining worker threads that never join back. I think if it did happen, it's a serious logic error

So instead of treating every non-zero return the same way, I think the join loop should capture the actual return code per-thread and log which thread failed and with what error. ESRCH could be treated as a warning (the thread is already gone, nothing to clean up), while EINVAL or EDEADLK would indicate a real bug worth throwing for. After the loop finishes and g_tid is freed, we throw only if a genuinely unexpected error was seen.

If this approach looks reasonable, I'll update the branch with the implementation.

@sahitya-chandra sahitya-chandra requested a review from markwkm April 1, 2026 18:57
@markwkm

markwkm commented Apr 3, 2026

Copy link
Copy Markdown
Contributor

So instead of treating every non-zero return the same way, I think the join loop should capture the actual return code per-thread and log which thread failed and with what error. ESRCH could be treated as a warning (the thread is already gone, nothing to clean up), while EINVAL or EDEADLK would indicate a real bug worth throwing for. After the loop finishes and g_tid is freed, we throw only if a genuinely unexpected error was seen.

If this approach looks reasonable, I'll update the branch with the implementation.

I agree, this sound reasonable.

@sahitya-chandra

Copy link
Copy Markdown
Contributor Author

Thanks sir, I will go ahead with the implementation now

Handle thread join failure by freeing thread IDs and throwing an error.
Instead of treating all pthread_join failures the same, capture the
return code per-thread and respond based on the error:

- ESRCH (no such thread): log a warning and continue, since the thread
  is already gone and there is nothing left to wait for.
- EINVAL/EDEADLK or any other error: log the error and flag as fatal,
  since these indicate a programming bug.

After attempting to join every thread, free g_tid and only then throw
if a fatal error was seen.  This keeps all handles valid until every
join has been tried.
@sahitya-chandra sahitya-chandra force-pushed the fix/driver-g_tid-memory-leak branch from 4ced9a9 to 033bc38 Compare April 3, 2026 18:10
Comment thread src/Driver/Driver.cpp Outdated
continue;

ostringstream osErr;
if (rc == ESRCH) {

@markwkm markwkm Apr 7, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think readability is more important here, so don't use multiple if statements. Additional comments that reiterate your rationale for doing it this way would be good too.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback sir. I think a switch statement would read much better here than chained if/else if blocks , one case per error code makes it easier to scan at a glance. I'll also add comments inside each case to explain why that error matters in this context, so the rationale is clear to anyone reading the code later

I'll push that update now

Use a switch statement instead of if/else chain for readability.
Each case handles a distinct pthread_join error code:
- ESRCH: thread already gone, non-fatal warning
- EINVAL: thread not joinable, indicates a programming bug
- EDEADLK: circular join detected, serious logic error
g_tid is freed after all joins are attempted, then throws only if a
fatal error was seen.
@sahitya-chandra sahitya-chandra requested a review from markwkm April 8, 2026 14:27
Comment thread src/Driver/Driver.cpp Outdated
throw new CThreadErr(
CThreadErr::ERR_THREAD_JOIN, "Driver::RunTest");
int rc = pthread_join(g_tid[i], NULL);
if (rc == 0)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be part of the switch. Any reason not to?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had kept rc == 0 outside the switch as a simple fast path since there’s nothing to handle in the success case, but I agree folding it into the switch is cleaner overall.

I’ll update it.

@sahitya-chandra sahitya-chandra requested a review from markwkm April 9, 2026 06:05
Comment thread src/Driver/Driver.cpp
ostringstream osErr;
switch (rc) {
case 0:
continue;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment here, would be good too.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure sir. I'll add a comment there. I think something like Thread joined successfully, nothing to report keeps it consistent with the other cases

Comment thread src/Driver/Driver.cpp
free(g_tid);
g_tid = NULL;

if (fatal_error) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting so I don't forget later. If we make it this far, I'm wondering if throwing an error is necessary if we're past the expected test duration.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the test duration has already passed, I assume the useful work is done and the threads should be exiting on their own. Throwing at that point feels like it could mask an otherwise clean run. I think we could check against stop_time and just log the error instead of throwing if we're past the expected duration. Happy to tackle it in this PR if you'd prefer, otherwise I can do it as a follow-up.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that idea to check against stop_time. Log a warning or error as appropriate depending on the condition and explain the situation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks sir. I've pushed the update in commit 0c602a6. Now if pthread_join reports an error after stop_time has already passed, the loop logs a warning and returns normally instead of throwing, since the useful work of the run is already done and the threads are expected to be winding down. If we're still within the scheduled duration, it throws as before so the caller knows the run aborted early. I've added a comment above the check explaining the rationale.

If the scheduled test duration has already elapsed, the useful work of
the run is complete and worker threads are expected to be winding down.
A pthread_join failure at that point is likely just noise from a thread
that has already exited, so propagating it as an error would mask an
otherwise clean run. Log a warning and return normally in that case;
if we are still within the scheduled duration, throw as before so the
caller knows the run aborted early.
@sahitya-chandra sahitya-chandra requested a review from markwkm April 20, 2026 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants