Skip to content

clean_crowns over-cleans due to transitive union-find clustering #226

@CiSong10

Description

@CiSong10

Dear developer,

Thank you for the updates of the clean_crowns function and the post_clean function. I was excited to test it but unfortunately, I ran into a significant over-cleaning issue I'd like to flag.

Image

The screenshot shows the result on one of my datasets. Yellow polygons are the original uncleaned crowns, red are the crowns after clean_crowns, and blue are crowns added back by post_clean. As you see, there are still huge gaps even after post_clean converges.

Root cause

I believe this is due to the "transitivity problem" of the union-find cluster strategy in clean_crowns. Consider three crowns A, B, C arranged in a line:

  • A overlaps B, B overlaps C, while A and C does not overlap.
  • A and B are merged, B and C are merged. A, B, C will be in one cluster.
  • Say A has the highest confidence score, C will be removed even though it does not overlap A at all.

Proposed fix

Rather than keeping one crown per cluster, I'd suggest a greedy NMS (non-maximum suppression) approach that resolves conflicts locally and directly. Crowns are sorted by confidence score descending; a crown is kept if it doesn't directly conflict with any already-kept crown, and only its direct conflicts are suppressed.

def clean_crowns():
    # 1. Filter
    ...
    # 2. Spatial join for candidate pairs
    ...
    # 3. Build a conflict graph: edges between crowns that truly conflict:
    # (high IoU OR containment).
    from collections import defaultdict
    conflicts = defaultdict(set)  # crown_idx -> set of conflicting crown_idxs

    for _, row in tqdm(
        join.iterrows(),
        ...
    ):
        i = row.name
        j = row["index_right"]
        if i >= j:
            continue  # Process each pair once

        geom_i = crowns.at[i, "geometry"]
        geom_j = crowns.at[j, "geometry"]
        intersection_area = geom_i.intersection(geom_j).area

        # IoU check
        iou_val = calc_iou(geom_i, geom_j)
        is_conflict = iou_val > iou_threshold

        # Containment check
        if not is_conflict and containment_threshold is not None:
            min_area = min(geom_i.area, geom_j.area)
            if min_area > 0 and (intersection_area / min_area) > containment_threshold:
                is_conflict = True

        if is_conflict:
            conflicts[i].add(j)
            conflicts[j].add(i)

    # 4. Greedy NMS: sort by confidence descending, 
    #  keep a crown if it doesn't conflict with any already-kept crown. 
    sorted_indices = crowns[field].sort_values(ascending=False).index.tolist()
    kept = set()
    removed = set()

    for i in sorted_indices:
        if i in removed:
            continue
        kept.add(i)
        # Only remove direct conflicts, not transitive ones
        for j in conflicts[i]:
            removed.add(j)

    cleaned_crowns = crowns.loc[sorted(kept)].copy()
    return gpd.GeoDataFrame(cleaned_crowns, crs=crowns.crs).reset_index(drop=True)

The result is shown below. I think this gives a much better balance between retaining high-confidence predictions and maintaining coverage of the tree canopy. If doing it right, it isn't even necessary to run post_clean, or only one interation of post_clean is enough.

Image

In #192, I also explored a slightly different strategy for the containment case: comparing the average confidence score of the smaller (contained) crown against the larger (container) crown before deciding which to keep. This produces slightly different results from the greedy NMS approach above, while a single-shot NMS is easier and more straightforward. I'd love to hear your thoughts on this.

Thank you for you attention to this issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions