You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
the query is sorting by commit hash, meaning that, because of how hashes are generated in git, a contributor with multiple commits to resolve will have multiple rows in this file spread theoretically evenly throughout the results of this query
This means we have to perform extra iterations later and rely on our per-iteration checks to catch things like a contributor that got resolved on the first iteration, and now has several other records that need linking to the same already-resolved contributor ID.
What I would ideally like to see is a query that gives us one record back per unique contributor email address (which means we would need to have a LIST of commit hashes).
this gives us several key benefits:
a single iteration of the contributor processing loop for each contributor (by unique email), meaning less time spent checking and re-checking the same email over and over
the contributor resolution process will have access to multiple commit hashes to check with the github API, better guarding against one off API errors that put records in the unresolved table (i.e. enabling logic like "okay the first commit didnt give us the github username, lest try 2 or 3 more commits before falling back to the less reliable search API for resolution")
in other words, this dramatically reduces overlapping processing. For example, while the current process already links all commits matching the resolved email to the contributor, this new process would prevent a scenario where the first commit from a contributor is seen, the contributor is resolved and linked (including all their commits), yet that contributor is not removed from the iteration loop/still exists in the initial query (above) because they were so heavily duplicated. This leads to future iterations of contributor resolution running for records that already have their cmt_ght_author_id set to a value, straight up wasting time and processing cycles
This "list of commits" behavior can be achieved with a postgres string aggregate function, to comma-separate the hashes in the single existing column. 8Knot already uses this behavior in other parts of its code and it is something Cali is familiar with.
https://github.com/chaoss/augur/blob/49a008ab97c43472339e400cb316a5323110d78d/augur/tasks/github/facade_github/tasks.py#L210-L246
This query at the start of contributor resolution returns the commit contributors (name, email, commit hash).
There are a few, less impactful problems with this that cause it to make contributor resolution less efficient than it could be.
This means we have to perform extra iterations later and rely on our per-iteration checks to catch things like a contributor that got resolved on the first iteration, and now has several other records that need linking to the same already-resolved contributor ID.
What I would ideally like to see is a query that gives us one record back per unique contributor email address (which means we would need to have a LIST of commit hashes).
this gives us several key benefits:
in other words, this dramatically reduces overlapping processing. For example, while the current process already links all commits matching the resolved email to the contributor, this new process would prevent a scenario where the first commit from a contributor is seen, the contributor is resolved and linked (including all their commits), yet that contributor is not removed from the iteration loop/still exists in the initial query (above) because they were so heavily duplicated. This leads to future iterations of contributor resolution running for records that already have their
cmt_ght_author_idset to a value, straight up wasting time and processing cyclesref: https://github.com/chaoss/augur/blob/49a008ab97c43472339e400cb316a5323110d78d/augur/tasks/github/facade_github/tasks.py#L160
This "list of commits" behavior can be achieved with a postgres string aggregate function, to comma-separate the hashes in the single existing column. 8Knot already uses this behavior in other parts of its code and it is something Cali is familiar with.