Tickets/dm 54206 by jgates108 · Pull Request #1015 · lsst/qserv

jgates108 · 2026-03-31T16:29:42Z

The WorkerCzarComIssue message from the worker to the czar contains all UberJob file ready messages that failed to be transmitted to the czar. If one of the failed transmits in the WorkerCzarComIssue had been completed or cancelled, it would cause the czar to tell the worker there was a parse error and none of the file ready messages would be removed from the worker's list. That would cause them all to be sent again in the next time the worker sent the WorkerCzarComIssue message, which would compound the problem.

The fix is to have better parsing and return a list that has more information about what the czar is doing with each UberJob in the list so the worker can make better decisions.

fritzm · 2026-04-07T22:17:36Z

src/ccontrol/UserQuerySelect.cc

            lock_guard<mutex> lock(chunksMtx);
            bool fillInChunkIdTag = false;  // do not fill in the chunkId
-            cs = _qSession->buildChunkQuerySpec(queryTemplates, chunkSpec, fillInChunkIdTag);
+            cs = _qSession->buildChunkQuerySpec(queryTemplates, chunkSpec, fillInChunkIdTag, checkDbs);


Is there a reason to pass in checkDBs here as an argument, rather than having an internal _dbsChecked member on the session, like one would do with a lazy constructor?

The "lazy" way I think would have fewer edits, and would be more robust against future code changes, guaranteeing always called exactly once before first needed?

After pulling in a rebase from main, some tests on the cluster got painfully slow (from 4min to 35min) as the method for accessing CSS changed on main. This is meant to be stop gap measure to allow testing and reasonable performance until it gets fixed properly. Existing comment I put in buildChunkQuerySpec. validateDominantDbs() is called in numerous places for unknown reasons.
/// TODO: checkDominantDb may not be the best way to go about this, but checking it
/// for every chunk is not useful and wastes a lot of time as the information doesn't
/// change and all chunks use the same databases (while using different tables).
/// There are things that could be done, maybe all are applicable, or maybe not worth the effort:
/// - validateDominantDbs() is called in many places and maybe it just doesn't need to be called
/// that often.
/// - Create a cache of CSS that is immutable so many threads can read without locking.
/// - When a query is started, it asks for a shared pointer to the CSS cache, if the
/// CSS cache is obsolete, a new cache is made and returned, otherwise the existing
/// cache is returned. This way, the CSS cache is only updated when the CSS changes,
/// and all threads can read from it without locking. Changing CSS values mid analysis
// may be detrimental, and this would prevent that as well.
if (checkDominantDb && !validateDominantDbs()) {

fritzm · 2026-04-07T22:21:39Z

src/cconfig/CzarConfig.h

+    CVTIntPtr _activeWorkerTimeoutAliveSecs =  // 3min
+            util::ConfigValTInt::create(_configValMap, "activeworker", "timeoutAliveSecs", notReq, 60 * 3);
+    CVTIntPtr _activeWorkerTimeoutDeadSecs =  // 6min
+            util::ConfigValTInt::create(_configValMap, "activeworker", "timeoutDeadSecs", notReq, 60 * 6);


These changes seem unrelated to the others -- could we at least have them on a separate commit here? (Or better if moved to a different PR where they are more closely related?)

iagaponenko · 2026-04-07T22:43:02Z

I have a general comment on this commit:

Reduced the number of database verification calls.

The implementation is excessive and rests on an incorrect assumption that the checks need to be controlled externally (via the checkDBs parameter). In reality, the "dominant" DB validation has to be performed just once (in the lazy mode) during the query analysis stage. A reason for that is that all databases are already mentioned in the user query. Hence, the validation could be performed at one of the query entrance points (QuerySession). If any changes to the dominant databases occurred during subsequent query dispatch, then more significant problems with the query processing would be seen.

My second observation on this commit is that it has a bunch of other changes that are totally irrelevant to what the commit's title says (and what the commit is meant to address). Any chance these changes could be pushed into different commits to let the commit be consistent with expectations and its description?

This is a solution that I would expect to see in this commit:

class QuerySession {
private:
    mutable bool _domainValidated = false;  ///< True if the dominant databases have been validated
};

bool QuerySession::validateDominantDbs() const {
    // Do it once since it can be expensive, and the result should not change during the query session.
    if (_domainValidated) return true;
    _domainValidated = true;
    ..
}

My third comment for the commit is that I thought that we agreed that the "dominant" DB validation was made in the main branch first (presumably similarly to how I posted above in the code snippet), and then it would propagate into the "dev" branch via the rebase. In this case, we would avoid code duplications and unnecessary conflicts during subsequent rebases since the very same optimization is also needed in the production ("main") branch o Qserv.

iagaponenko · 2026-04-07T23:40:24Z

My other general comment for the PR is based on the following intent expressed in the description of the PR:

The fix is to have better parsing and return a list that has more information about what the czar is doing with each UberJob in the list so the worker can make better decisions.

The concern is that the PR introduces the behavior dependency of the worker on the context of operations, which are being performed by Czar. My impression has always been that Qserv workers are supposed to behave deterministically (module errors) by following exact instructions sent by Czars, including:

executing queries
cancelling query requests
returning result files of the completed jobs (uberjobs)
deleted result files

And if any changes in the Czar's query processing plan happen, then the corresponding corrective commands would be sent to the workers. This has always been one of the fundamental principles of the Qserv architecture. The ticket/PR description seems to indicate a significant deviation from this architecture.

Could this be explained, please?

jgates108 · 2026-04-08T00:38:29Z

"The concern is that the PR introduces the behavior dependency of the worker on the context of operations, which are being performed by Czar. My impression has always been that Qserv workers are supposed to behave deterministically (module errors) by following exact instructions sent by Czars, including:"

The purpose of the changes is to bring the worker and czar back into sync after some degree of communication failures. The WorkerCzarComIssue message contains what would have been sent from the worker to the specific czar if there had not been a network failure. If the network failure was really long, the message says it cancelled all of the work the czar had given it (the "I thought you were dead" flag).

The complexity in WorkerCzarComIssue is keeping track of what has been successfully sent to the czar so that is not sent repeatedly, and being sent repeatedly needs to be harmless.

UberJobs, as always, either succeed, fail, or the czar cancels them. None of that has changed. In some cases they may leave behind a complete and accurate result file for the UberJob, which will get picked up by garbage collection (this has to do with LIMIT queries).

Tickets/dm 54206

jgates108 added 2 commits March 30, 2026 12:14

Passes integration tests in both modes.

d44f46a

Improved messaging, added unit tests.

51eb567

jgates108 force-pushed the tickets/DM-54206 branch from 1606c05 to 51eb567 Compare March 31, 2026 17:40

jgates108 marked this pull request as ready for review April 7, 2026 21:51

fritzm reviewed Apr 7, 2026

View reviewed changes

Reduced the number of database verification calls.

e892473

jgates108 force-pushed the tickets/DM-54206 branch from c242f89 to e892473 Compare April 8, 2026 00:22

jgates108 merged commit 78e886c into tickets/DM-43715 Apr 8, 2026
7 checks passed

jgates108 deleted the tickets/DM-54206 branch April 8, 2026 00:51

fritzm pushed a commit that referenced this pull request Apr 10, 2026

Merge pull request #1015 from lsst/tickets/DM-54206

d73ac0c

Tickets/dm 54206

fritzm pushed a commit that referenced this pull request Apr 13, 2026

Merge pull request #1015 from lsst/tickets/DM-54206

01ded65

Tickets/dm 54206

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tickets/dm 54206#1015

Tickets/dm 54206#1015
jgates108 merged 3 commits intotickets/DM-43715from
tickets/DM-54206

jgates108 commented Mar 31, 2026

Uh oh!

fritzm Apr 7, 2026 •

edited

Loading

Uh oh!

jgates108 Apr 8, 2026

Uh oh!

fritzm Apr 7, 2026

Uh oh!

jgates108 Apr 8, 2026

Uh oh!

iagaponenko commented Apr 7, 2026

Uh oh!

iagaponenko commented Apr 7, 2026

Uh oh!

jgates108 commented Apr 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jgates108 commented Mar 31, 2026

Uh oh!

fritzm Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jgates108 Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

fritzm Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

jgates108 Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

iagaponenko commented Apr 7, 2026

Uh oh!

iagaponenko commented Apr 7, 2026

Uh oh!

jgates108 commented Apr 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fritzm Apr 7, 2026 •

edited

Loading