`clean_crowns` over-cleans due to transitive union-find clustering

Dear developer,

Thank you for the updates of the `clean_crowns` function and the `post_clean` function. I was excited to test it but unfortunately, I ran into a significant over-cleaning issue I'd like to flag.

<img width="737" height="505" alt="Image" src="https://github.com/user-attachments/assets/e1071191-213a-4a44-897a-75611b997c21" />

The screenshot shows the result on one of my datasets. Yellow polygons are the original uncleaned crowns, red are the crowns after `clean_crowns`, and blue are crowns added back by `post_clean`. As you see, there are still huge gaps even after `post_clean` converges. 

**Root cause**

I believe this is due to the "transitivity problem" of the union-find cluster strategy in `clean_crowns`. Consider three crowns A, B, C arranged in a line:
* A overlaps B, B overlaps C, while A and C does not overlap. 
* A and B are merged, B and C are merged. A, B, C will be in one cluster.
* Say A has the highest confidence score, C will be removed even though it does not overlap A at all.

**Proposed fix** 

Rather than keeping one crown per cluster, I'd suggest a greedy NMS (non-maximum suppression) approach that resolves conflicts locally and directly. Crowns are sorted by confidence score descending; a crown is kept if it doesn't directly conflict with any already-kept crown, and only its direct conflicts are suppressed.

```
def clean_crowns():
    # 1. Filter
    ...
    # 2. Spatial join for candidate pairs
    ...
    # 3. Build a conflict graph: edges between crowns that truly conflict:
    # (high IoU OR containment).
    from collections import defaultdict
    conflicts = defaultdict(set)  # crown_idx -> set of conflicting crown_idxs

    for _, row in tqdm(
        join.iterrows(),
        ...
    ):
        i = row.name
        j = row["index_right"]
        if i >= j:
            continue  # Process each pair once

        geom_i = crowns.at[i, "geometry"]
        geom_j = crowns.at[j, "geometry"]
        intersection_area = geom_i.intersection(geom_j).area

        # IoU check
        iou_val = calc_iou(geom_i, geom_j)
        is_conflict = iou_val > iou_threshold

        # Containment check
        if not is_conflict and containment_threshold is not None:
            min_area = min(geom_i.area, geom_j.area)
            if min_area > 0 and (intersection_area / min_area) > containment_threshold:
                is_conflict = True

        if is_conflict:
            conflicts[i].add(j)
            conflicts[j].add(i)

    # 4. Greedy NMS: sort by confidence descending, 
    #  keep a crown if it doesn't conflict with any already-kept crown. 
    sorted_indices = crowns[field].sort_values(ascending=False).index.tolist()
    kept = set()
    removed = set()

    for i in sorted_indices:
        if i in removed:
            continue
        kept.add(i)
        # Only remove direct conflicts, not transitive ones
        for j in conflicts[i]:
            removed.add(j)

    cleaned_crowns = crowns.loc[sorted(kept)].copy()
    return gpd.GeoDataFrame(cleaned_crowns, crs=crowns.crs).reset_index(drop=True)
```

The result is shown below. I think this gives a much better balance between retaining high-confidence predictions and maintaining coverage of the tree canopy. If doing it right, it isn't even necessary to run `post_clean`, or only one interation of `post_clean` is enough.

<img width="761" height="548" alt="Image" src="https://github.com/user-attachments/assets/e465643d-b830-492f-b7c5-8e033978f4ca" />

In https://github.com/PatBall1/detectree2/issues/192, I also explored a slightly different strategy for the containment case: comparing the _average confidence score_ of the smaller (contained) crown against the larger (container) crown before deciding which to keep. This produces slightly different results from the greedy NMS approach above, while a single-shot NMS is easier and more straightforward. I'd love to hear your thoughts on this.

Thank you for you attention to this issue.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`clean_crowns` over-cleans due to transitive union-find clustering #226

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

clean_crowns over-cleans due to transitive union-find clustering #226

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

`clean_crowns` over-cleans due to transitive union-find clustering #226