Skip to content

Tickets/dm 54206#1015

Merged
jgates108 merged 3 commits intotickets/DM-43715from
tickets/DM-54206
Apr 8, 2026
Merged

Tickets/dm 54206#1015
jgates108 merged 3 commits intotickets/DM-43715from
tickets/DM-54206

Conversation

@jgates108
Copy link
Copy Markdown
Contributor

The WorkerCzarComIssue message from the worker to the czar contains all UberJob file ready messages that failed to be transmitted to the czar. If one of the failed transmits in the WorkerCzarComIssue had been completed or cancelled, it would cause the czar to tell the worker there was a parse error and none of the file ready messages would be removed from the worker's list. That would cause them all to be sent again in the next time the worker sent the WorkerCzarComIssue message, which would compound the problem.

The fix is to have better parsing and return a list that has more information about what the czar is doing with each UberJob in the list so the worker can make better decisions.

@jgates108 jgates108 marked this pull request as ready for review April 7, 2026 21:51
lock_guard<mutex> lock(chunksMtx);
bool fillInChunkIdTag = false; // do not fill in the chunkId
cs = _qSession->buildChunkQuerySpec(queryTemplates, chunkSpec, fillInChunkIdTag);
cs = _qSession->buildChunkQuerySpec(queryTemplates, chunkSpec, fillInChunkIdTag, checkDbs);
Copy link
Copy Markdown
Contributor

@fritzm fritzm Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to pass in checkDBs here as an argument, rather than having an internal _dbsChecked member on the session, like one would do with a lazy constructor?

The "lazy" way I think would have fewer edits, and would be more robust against future code changes, guaranteeing always called exactly once before first needed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After pulling in a rebase from main, some tests on the cluster got painfully slow (from 4min to 35min) as the method for accessing CSS changed on main. This is meant to be stop gap measure to allow testing and reasonable performance until it gets fixed properly. Existing comment I put in buildChunkQuerySpec. validateDominantDbs() is called in numerous places for unknown reasons.
/// TODO: checkDominantDb may not be the best way to go about this, but checking it
/// for every chunk is not useful and wastes a lot of time as the information doesn't
/// change and all chunks use the same databases (while using different tables).
/// There are things that could be done, maybe all are applicable, or maybe not worth the effort:
/// - validateDominantDbs() is called in many places and maybe it just doesn't need to be called
/// that often.
/// - Create a cache of CSS that is immutable so many threads can read without locking.
/// - When a query is started, it asks for a shared pointer to the CSS cache, if the
/// CSS cache is obsolete, a new cache is made and returned, otherwise the existing
/// cache is returned. This way, the CSS cache is only updated when the CSS changes,
/// and all threads can read from it without locking. Changing CSS values mid analysis
// may be detrimental, and this would prevent that as well.
if (checkDominantDb && !validateDominantDbs()) {

CVTIntPtr _activeWorkerTimeoutAliveSecs = // 3min
util::ConfigValTInt::create(_configValMap, "activeworker", "timeoutAliveSecs", notReq, 60 * 3);
CVTIntPtr _activeWorkerTimeoutDeadSecs = // 6min
util::ConfigValTInt::create(_configValMap, "activeworker", "timeoutDeadSecs", notReq, 60 * 6);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes seem unrelated to the others -- could we at least have them on a separate commit here? (Or better if moved to a different PR where they are more closely related?)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

@iagaponenko
Copy link
Copy Markdown
Contributor

I have a general comment on this commit:

Reduced the number of database verification calls.

The implementation is excessive and rests on an incorrect assumption that the checks need to be controlled externally (via the checkDBs parameter). In reality, the "dominant" DB validation has to be performed just once (in the lazy mode) during the query analysis stage. A reason for that is that all databases are already mentioned in the user query. Hence, the validation could be performed at one of the query entrance points (QuerySession). If any changes to the dominant databases occurred during subsequent query dispatch, then more significant problems with the query processing would be seen.

My second observation on this commit is that it has a bunch of other changes that are totally irrelevant to what the commit's title says (and what the commit is meant to address). Any chance these changes could be pushed into different commits to let the commit be consistent with expectations and its description?

This is a solution that I would expect to see in this commit:

class QuerySession {
private:
    mutable bool _domainValidated = false;  ///< True if the dominant databases have been validated
};

bool QuerySession::validateDominantDbs() const {
    // Do it once since it can be expensive, and the result should not change during the query session.
    if (_domainValidated) return true;
    _domainValidated = true;
    ..
}

My third comment for the commit is that I thought that we agreed that the "dominant" DB validation was made in the main branch first (presumably similarly to how I posted above in the code snippet), and then it would propagate into the "dev" branch via the rebase. In this case, we would avoid code duplications and unnecessary conflicts during subsequent rebases since the very same optimization is also needed in the production ("main") branch o Qserv.

@iagaponenko
Copy link
Copy Markdown
Contributor

My other general comment for the PR is based on the following intent expressed in the description of the PR:

The fix is to have better parsing and return a list that has more information about what the czar is doing with each UberJob in the list so the worker can make better decisions.

The concern is that the PR introduces the behavior dependency of the worker on the context of operations, which are being performed by Czar. My impression has always been that Qserv workers are supposed to behave deterministically (module errors) by following exact instructions sent by Czars, including:

  • executing queries
  • cancelling query requests
  • returning result files of the completed jobs (uberjobs)
  • deleted result files

And if any changes in the Czar's query processing plan happen, then the corresponding corrective commands would be sent to the workers. This has always been one of the fundamental principles of the Qserv architecture. The ticket/PR description seems to indicate a significant deviation from this architecture.

Could this be explained, please?

@jgates108
Copy link
Copy Markdown
Contributor Author

"The concern is that the PR introduces the behavior dependency of the worker on the context of operations, which are being performed by Czar. My impression has always been that Qserv workers are supposed to behave deterministically (module errors) by following exact instructions sent by Czars, including:"

The purpose of the changes is to bring the worker and czar back into sync after some degree of communication failures. The WorkerCzarComIssue message contains what would have been sent from the worker to the specific czar if there had not been a network failure. If the network failure was really long, the message says it cancelled all of the work the czar had given it (the "I thought you were dead" flag).

The complexity in WorkerCzarComIssue is keeping track of what has been successfully sent to the czar so that is not sent repeatedly, and being sent repeatedly needs to be harmless.

UberJobs, as always, either succeed, fail, or the czar cancels them. None of that has changed. In some cases they may leave behind a complete and accurate result file for the UberJob, which will get picked up by garbage collection (this has to do with LIMIT queries).

@jgates108 jgates108 merged commit 78e886c into tickets/DM-43715 Apr 8, 2026
7 checks passed
@jgates108 jgates108 deleted the tickets/DM-54206 branch April 8, 2026 00:51
fritzm pushed a commit that referenced this pull request Apr 10, 2026
fritzm pushed a commit that referenced this pull request Apr 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants