Conversation
1606c05 to
51eb567
Compare
| lock_guard<mutex> lock(chunksMtx); | ||
| bool fillInChunkIdTag = false; // do not fill in the chunkId | ||
| cs = _qSession->buildChunkQuerySpec(queryTemplates, chunkSpec, fillInChunkIdTag); | ||
| cs = _qSession->buildChunkQuerySpec(queryTemplates, chunkSpec, fillInChunkIdTag, checkDbs); |
There was a problem hiding this comment.
Is there a reason to pass in checkDBs here as an argument, rather than having an internal _dbsChecked member on the session, like one would do with a lazy constructor?
The "lazy" way I think would have fewer edits, and would be more robust against future code changes, guaranteeing always called exactly once before first needed?
There was a problem hiding this comment.
After pulling in a rebase from main, some tests on the cluster got painfully slow (from 4min to 35min) as the method for accessing CSS changed on main. This is meant to be stop gap measure to allow testing and reasonable performance until it gets fixed properly. Existing comment I put in buildChunkQuerySpec. validateDominantDbs() is called in numerous places for unknown reasons.
/// TODO: checkDominantDb may not be the best way to go about this, but checking it
/// for every chunk is not useful and wastes a lot of time as the information doesn't
/// change and all chunks use the same databases (while using different tables).
/// There are things that could be done, maybe all are applicable, or maybe not worth the effort:
/// - validateDominantDbs() is called in many places and maybe it just doesn't need to be called
/// that often.
/// - Create a cache of CSS that is immutable so many threads can read without locking.
/// - When a query is started, it asks for a shared pointer to the CSS cache, if the
/// CSS cache is obsolete, a new cache is made and returned, otherwise the existing
/// cache is returned. This way, the CSS cache is only updated when the CSS changes,
/// and all threads can read from it without locking. Changing CSS values mid analysis
// may be detrimental, and this would prevent that as well.
if (checkDominantDb && !validateDominantDbs()) {
src/cconfig/CzarConfig.h
Outdated
| CVTIntPtr _activeWorkerTimeoutAliveSecs = // 3min | ||
| util::ConfigValTInt::create(_configValMap, "activeworker", "timeoutAliveSecs", notReq, 60 * 3); | ||
| CVTIntPtr _activeWorkerTimeoutDeadSecs = // 6min | ||
| util::ConfigValTInt::create(_configValMap, "activeworker", "timeoutDeadSecs", notReq, 60 * 6); |
There was a problem hiding this comment.
These changes seem unrelated to the others -- could we at least have them on a separate commit here? (Or better if moved to a different PR where they are more closely related?)
|
I have a general comment on this commit: The implementation is excessive and rests on an incorrect assumption that the checks need to be controlled externally (via the My second observation on this commit is that it has a bunch of other changes that are totally irrelevant to what the commit's title says (and what the commit is meant to address). Any chance these changes could be pushed into different commits to let the commit be consistent with expectations and its description? This is a solution that I would expect to see in this commit: My third comment for the commit is that I thought that we agreed that the "dominant" DB validation was made in the |
|
My other general comment for the PR is based on the following intent expressed in the description of the PR:
The concern is that the PR introduces the behavior dependency of the worker on the context of operations, which are being performed by Czar. My impression has always been that Qserv workers are supposed to behave deterministically (module errors) by following exact instructions sent by Czars, including:
And if any changes in the Czar's query processing plan happen, then the corresponding corrective commands would be sent to the workers. This has always been one of the fundamental principles of the Qserv architecture. The ticket/PR description seems to indicate a significant deviation from this architecture. Could this be explained, please? |
c242f89 to
e892473
Compare
|
"The concern is that the PR introduces the behavior dependency of the worker on the context of operations, which are being performed by Czar. My impression has always been that Qserv workers are supposed to behave deterministically (module errors) by following exact instructions sent by Czars, including:" The purpose of the changes is to bring the worker and czar back into sync after some degree of communication failures. The The complexity in UberJobs, as always, either succeed, fail, or the czar cancels them. None of that has changed. In some cases they may leave behind a complete and accurate result file for the UberJob, which will get picked up by garbage collection (this has to do with LIMIT queries). |
The
WorkerCzarComIssuemessage from the worker to the czar contains all UberJob file ready messages that failed to be transmitted to the czar. If one of the failed transmits in theWorkerCzarComIssuehad been completed or cancelled, it would cause the czar to tell the worker there was a parse error and none of the file ready messages would be removed from the worker's list. That would cause them all to be sent again in the next time the worker sent theWorkerCzarComIssuemessage, which would compound the problem.The fix is to have better parsing and return a list that has more information about what the czar is doing with each UberJob in the list so the worker can make better decisions.