Skip to content

order contributors to resolve by unique email for improved efficiency and github API fault tolerance #3780

@MoralCode

Description

@MoralCode

https://github.com/chaoss/augur/blob/49a008ab97c43472339e400cb316a5323110d78d/augur/tasks/github/facade_github/tasks.py#L210-L246

This query at the start of contributor resolution returns the commit contributors (name, email, commit hash).

There are a few, less impactful problems with this that cause it to make contributor resolution less efficient than it could be.

  1. we are querying from the commits table, which is a misnomer and actually has one row per file per commit (known issue commits table is actually representing commit file changes #3682 )
  2. the query is sorting by commit hash, meaning that, because of how hashes are generated in git, a contributor with multiple commits to resolve will have multiple rows in this file spread theoretically evenly throughout the results of this query

This means we have to perform extra iterations later and rely on our per-iteration checks to catch things like a contributor that got resolved on the first iteration, and now has several other records that need linking to the same already-resolved contributor ID.

What I would ideally like to see is a query that gives us one record back per unique contributor email address (which means we would need to have a LIST of commit hashes).

this gives us several key benefits:

  1. a single iteration of the contributor processing loop for each contributor (by unique email), meaning less time spent checking and re-checking the same email over and over
  2. the contributor resolution process will have access to multiple commit hashes to check with the github API, better guarding against one off API errors that put records in the unresolved table (i.e. enabling logic like "okay the first commit didnt give us the github username, lest try 2 or 3 more commits before falling back to the less reliable search API for resolution")

in other words, this dramatically reduces overlapping processing. For example, while the current process already links all commits matching the resolved email to the contributor, this new process would prevent a scenario where the first commit from a contributor is seen, the contributor is resolved and linked (including all their commits), yet that contributor is not removed from the iteration loop/still exists in the initial query (above) because they were so heavily duplicated. This leads to future iterations of contributor resolution running for records that already have their cmt_ght_author_id set to a value, straight up wasting time and processing cycles

ref: https://github.com/chaoss/augur/blob/49a008ab97c43472339e400cb316a5323110d78d/augur/tasks/github/facade_github/tasks.py#L160

This "list of commits" behavior can be achieved with a postgres string aggregate function, to comma-separate the hashes in the single existing column. 8Knot already uses this behavior in other parts of its code and it is something Cali is familiar with.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions