Pq eval by rhoadesScholar · Pull Request #200 · janelia-cellmap/cellmap-segmentation-challenge

rhoadesScholar · 2026-03-24T18:09:30Z

No description provided.

…ations - Consolidated parallel scoring execution into a single process pool for improved performance. - Updated scoring metrics to include panoptic quality (PQ), segmentation quality (SQ), and recognition quality (RQ). - Introduced new data structures for PQ accumulators and updated related tests. - Refactored tests to align with new scoring metrics and ensure comprehensive coverage. - Removed outdated test file for refactored evaluate functions.

…rops

Copilot

Pull request overview

This PR updates the evaluation pipeline to use Panoptic Quality (PQ)–style accumulators (tp/fp/fn/sum_iou) for both instance (“thing”) and semantic (“stuff”) scoring, and changes aggregation to compute macro-averaged PQ across crops per label.

Changes:

Replace instance/semantic metrics with PQ accumulator outputs and compute per-crop derived PQ/SQ/F1 fields.
Update aggregation to produce per-label pq/sq/rq and overall scores as unweighted means across labels (with legacy aliases preserved).
Simplify parallel scoring to a single ProcessPoolExecutor controlled by MAX_WORKERS, and update tests to assert PQ behavior.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
tests/test_evaluate_refactored.py	Removes a large refactor-focused test suite (coverage impact).
tests/test_evaluate_metrics.py	Updates tests to validate PQ accumulator semantics and new aggregation outputs.
src/cellmap_segmentation_challenge/utils/eval_utils/types.py	Adds `PQCropDict` TypedDict for PQ crop-level outputs.
src/cellmap_segmentation_challenge/utils/eval_utils/submission.py	Uses a unified process pool; updates score_submission results documentation.
src/cellmap_segmentation_challenge/utils/eval_utils/scoring.py	Reworks `score_instance`/`score_semantic` to return PQ accumulators and per-crop derived PQ/SQ/F1.
src/cellmap_segmentation_challenge/utils/eval_utils/instance_matching.py	Adds greedy IoU-threshold PQ matching (`match_instances_pq`).
src/cellmap_segmentation_challenge/utils/eval_utils/distance.py	Reads Hausdorff threading from legacy env var directly (`PER_INSTANCE_THREADS`).
src/cellmap_segmentation_challenge/utils/eval_utils/config.py	Introduces `max_workers` with env precedence and legacy fallback.
src/cellmap_segmentation_challenge/utils/eval_utils/aggregation.py	Aggregates PQ with macro-averaged per-crop PQ per label + overall unweighted means.
src/cellmap_segmentation_challenge/utils/eval_utils/init.py	Exposes `PQCropDict` and removes `PER_INSTANCE_THREADS` export.
src/cellmap_segmentation_challenge/evaluate.py	Updates re-exports to match changed eval_utils surface.

Comments suppressed due to low confidence (1)

src/cellmap_segmentation_challenge/utils/eval_utils/types.py:32

With the shift to PQ metrics, score_instance / score_semantic no longer return the fields described by InstanceScoreDict and SemanticScoreDict (they now return PQ accumulators like tp/fp/fn/sum_iou). Right now those TypedDicts remain exported, which is misleading for API consumers and type checking. Consider updating/deprecating InstanceScoreDict / SemanticScoreDict and annotating the PQ-returning functions with PQCropDict (or introducing dedicated PQ TypedDicts for instance vs semantic if needed).

class PQCropDict(TypedDict, total=False):
    """PQ accumulators returned per (crop, label) by score_label.

    Raw accumulators (tp/fp/fn/sum_iou) are summed globally across crops in
    ``combine_scores`` before per-category PQ/SQ/RQ values are derived.
    The derived ``pq`` and ``f1`` fields are computed per-crop for
    interpretability and are NOT used in the global aggregation.
    """

    tp: int
    fp: int
    fn: int
    sum_iou: float
    pq: float  # per-crop PQ = sum_iou / (tp + 0.5*fp + 0.5*fn)
    sq: float  # per-crop SQ = sum_iou / tp (mean IoU of matched pairs; 0 if tp=0)
    f1: float  # per-crop F1/RQ = 2*tp / (2*tp + fp + fn)
    num_voxels: int
    voxel_size: tuple[float, ...]
    is_missing: bool
    status: Literal["scored", "skipped_too_many_instances", "missing"]


class InstanceScoreDict(TypedDict, total=False):
    """Type definition for instance segmentation scores."""

    mean_accuracy: float
    binary_accuracy: float

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/cellmap_segmentation_challenge/utils/eval_utils/instance_matching.py

src/cellmap_segmentation_challenge/utils/eval_utils/aggregation.py

src/cellmap_segmentation_challenge/utils/eval_utils/submission.py

src/cellmap_segmentation_challenge/utils/eval_utils/scoring.py

src/cellmap_segmentation_challenge/utils/eval_utils/submission.py

Copilot · 2026-03-24T18:34:56Z

@rhoadesScholar I've opened a new pull request, #201, to work on those changes. Once the pull request is ready, I'll request review from you.

…ng on re-call Co-authored-by: rhoadesScholar <37990507+rhoadesScholar@users.noreply.github.com> Agent-Logs-Url: https://github.com/janelia-cellmap/cellmap-segmentation-challenge/sessions/4757d3c5-3419-49df-8a4f-c4d01db80ffd

Co-authored-by: rhoadesScholar <37990507+rhoadesScholar@users.noreply.github.com> Agent-Logs-Url: https://github.com/janelia-cellmap/cellmap-segmentation-challenge/sessions/10b4e7aa-4559-4c15-86c3-4006a0e9f441

fix: prevent combine_scores from re-consuming aggregation keys on repeated calls

…averaging

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

src/cellmap_segmentation_challenge/evaluate.py:30

evaluate.py is described as re-exporting public APIs for backward compatibility, but it no longer re-exports the previously-public InstanceScoreDict / SemanticScoreDict types (and PER_INSTANCE_THREADS was also dropped). If external users import these symbols from cellmap_segmentation_challenge.evaluate, this is a breaking change. Consider keeping the old names as deprecated aliases (e.g., aliasing to the new PQCropDict / keeping PER_INSTANCE_THREADS), or document the API break explicitly.

# Re-export all public APIs for backward compatibility
from .utils.eval_utils import (
    # Exceptions
    EvaluationError,
    TooManyInstancesError,
    TooManyOverlapEdgesError,
    MatchingFailedError,
    ValidationError,
    # Configuration
    EvaluationConfig,
    CAST_TO_NONE,
    MAX_INSTANCE_THREADS,
    MAX_SEMANTIC_THREADS,
    MAX_DISTANCE_CAP_EPS,
    FINAL_INSTANCE_RATIO_CUTOFF,
    INITIAL_INSTANCE_RATIO_CUTOFF,
    INSTANCE_RATIO_FACTOR,
    MAX_OVERLAP_EDGES,

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/cellmap_segmentation_challenge/utils/eval_utils/scoring.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

rhoadesScholar added 3 commits March 22, 2026 20:57

feat: enhance PQ computation by implementing macro-averaging across c…

1c127b1

…rops

feat: add per-crop PQ, SQ, and F1 metrics to scoring functions

8ca3a1a

Copilot AI review requested due to automatic review settings March 24, 2026 18:09

Copilot started reviewing on behalf of rhoadesScholar March 24, 2026 18:10 View session

Copilot AI reviewed Mar 24, 2026

View reviewed changes

Copilot started work on behalf of rhoadesScholar March 24, 2026 18:33 View session

rhoadesScholar and others added 2 commits March 24, 2026 14:33

feat: add instance ratio check in match_instances_pq function

2ee63f7

Initial plan

529b1cc

Copilot AI mentioned this pull request Mar 24, 2026

fix: prevent combine_scores from re-consuming aggregation keys on repeated calls #201

Merged

rhoadesScholar and others added 3 commits March 24, 2026 14:37

Merge branch 'main' into pq_eval

6130157

Hard-code iou_threshold=0.5 in match_instances_pq; remove parameter

502e5dc

Co-authored-by: rhoadesScholar <37990507+rhoadesScholar@users.noreply.github.com> Agent-Logs-Url: https://github.com/janelia-cellmap/cellmap-segmentation-challenge/sessions/10b4e7aa-4559-4c15-86c3-4006a0e9f441

Copilot finished work on behalf of rhoadesScholar March 24, 2026 18:42

rhoadesScholar and others added 3 commits March 24, 2026 14:45

Merge branch 'pq_eval' into copilot/sub-pr-200

5d7de47

Merge pull request #201 from janelia-cellmap/copilot/sub-pr-200

f010654

fix: prevent combine_scores from re-consuming aggregation keys on repeated calls

refactor: remove unused types and update combine_scores to use micro-…

d1cc851

…averaging

rhoadesScholar requested a review from Copilot March 24, 2026 19:18

Copilot started reviewing on behalf of rhoadesScholar March 24, 2026 19:19 View session

Copilot AI reviewed Mar 24, 2026

View reviewed changes

src/cellmap_segmentation_challenge/utils/eval_utils/scoring.py Show resolved Hide resolved

src/cellmap_segmentation_challenge/utils/eval_utils/scoring.py Show resolved Hide resolved

Update src/cellmap_segmentation_challenge/utils/eval_utils/scoring.py

899c983

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pq eval#200

Pq eval#200
rhoadesScholar wants to merge 12 commits intomainfrom
pq_eval

rhoadesScholar commented Mar 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI commented Mar 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rhoadesScholar commented Mar 24, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI commented Mar 24, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants