You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Aug 27, 2019. It is now read-only.
That would be fine if our duplication rate was 0%, but it's not. Immediately after Panoptes was released the duplication rate was much higher than normal, and even now that those bugs have been squashed it's non-zero for reasons that we can't control. And if a user gets the same subject twice and classifies the same way, every single feature they have marked will be split into 2 clusters even if the agglomeration method works perfectly. I think this could explain some of the weird behavior in #144 and #165 - the presence of duplicates could mean the agglomeration stops before it should have, so even if there's only 1 duplication it might lead to detecting >>2 clusters per actual feature.
Do the aggregations throw out duplicate classifications? If not, we should agglomerate based on not including 2 marks from the same classification_id, not from the same user_name or even user_name+created_at pair.
https://developer.zooniverse.org/projects/aggregation/en/latest/point_aggregation.html says the agglomerative clustering method determines when to stop combining clusters at least in part by refusing to merge 2 clusters that have markings by the same user, under the principle that (mostly) the same user won't mark the same feature twice.
That would be fine if our duplication rate was 0%, but it's not. Immediately after Panoptes was released the duplication rate was much higher than normal, and even now that those bugs have been squashed it's non-zero for reasons that we can't control. And if a user gets the same subject twice and classifies the same way, every single feature they have marked will be split into 2 clusters even if the agglomeration method works perfectly. I think this could explain some of the weird behavior in #144 and #165 - the presence of duplicates could mean the agglomeration stops before it should have, so even if there's only 1 duplication it might lead to detecting >>2 clusters per actual feature.
Do the aggregations throw out duplicate classifications? If not, we should agglomerate based on not including 2 marks from the same classification_id, not from the same user_name or even user_name+created_at pair.