Skip to content
This repository was archived by the owner on Aug 27, 2019. It is now read-only.
This repository was archived by the owner on Aug 27, 2019. It is now read-only.

do duplicate classifications break aggregations? #169

@vrooje

Description

@vrooje

https://developer.zooniverse.org/projects/aggregation/en/latest/point_aggregation.html says the agglomerative clustering method determines when to stop combining clusters at least in part by refusing to merge 2 clusters that have markings by the same user, under the principle that (mostly) the same user won't mark the same feature twice.

That would be fine if our duplication rate was 0%, but it's not. Immediately after Panoptes was released the duplication rate was much higher than normal, and even now that those bugs have been squashed it's non-zero for reasons that we can't control. And if a user gets the same subject twice and classifies the same way, every single feature they have marked will be split into 2 clusters even if the agglomeration method works perfectly. I think this could explain some of the weird behavior in #144 and #165 - the presence of duplicates could mean the agglomeration stops before it should have, so even if there's only 1 duplication it might lead to detecting >>2 clusters per actual feature.

Do the aggregations throw out duplicate classifications? If not, we should agglomerate based on not including 2 marks from the same classification_id, not from the same user_name or even user_name+created_at pair.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions