Skip to content

Pq eval#200

Open
rhoadesScholar wants to merge 12 commits intomainfrom
pq_eval
Open

Pq eval#200
rhoadesScholar wants to merge 12 commits intomainfrom
pq_eval

Conversation

@rhoadesScholar
Copy link
Copy Markdown
Member

No description provided.

…ations

- Consolidated parallel scoring execution into a single process pool for improved performance.
- Updated scoring metrics to include panoptic quality (PQ), segmentation quality (SQ), and recognition quality (RQ).
- Introduced new data structures for PQ accumulators and updated related tests.
- Refactored tests to align with new scoring metrics and ensure comprehensive coverage.
- Removed outdated test file for refactored evaluate functions.
Copilot AI review requested due to automatic review settings March 24, 2026 18:09
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the evaluation pipeline to use Panoptic Quality (PQ)–style accumulators (tp/fp/fn/sum_iou) for both instance (“thing”) and semantic (“stuff”) scoring, and changes aggregation to compute macro-averaged PQ across crops per label.

Changes:

  • Replace instance/semantic metrics with PQ accumulator outputs and compute per-crop derived PQ/SQ/F1 fields.
  • Update aggregation to produce per-label pq/sq/rq and overall scores as unweighted means across labels (with legacy aliases preserved).
  • Simplify parallel scoring to a single ProcessPoolExecutor controlled by MAX_WORKERS, and update tests to assert PQ behavior.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/test_evaluate_refactored.py Removes a large refactor-focused test suite (coverage impact).
tests/test_evaluate_metrics.py Updates tests to validate PQ accumulator semantics and new aggregation outputs.
src/cellmap_segmentation_challenge/utils/eval_utils/types.py Adds PQCropDict TypedDict for PQ crop-level outputs.
src/cellmap_segmentation_challenge/utils/eval_utils/submission.py Uses a unified process pool; updates score_submission results documentation.
src/cellmap_segmentation_challenge/utils/eval_utils/scoring.py Reworks score_instance/score_semantic to return PQ accumulators and per-crop derived PQ/SQ/F1.
src/cellmap_segmentation_challenge/utils/eval_utils/instance_matching.py Adds greedy IoU-threshold PQ matching (match_instances_pq).
src/cellmap_segmentation_challenge/utils/eval_utils/distance.py Reads Hausdorff threading from legacy env var directly (PER_INSTANCE_THREADS).
src/cellmap_segmentation_challenge/utils/eval_utils/config.py Introduces max_workers with env precedence and legacy fallback.
src/cellmap_segmentation_challenge/utils/eval_utils/aggregation.py Aggregates PQ with macro-averaged per-crop PQ per label + overall unweighted means.
src/cellmap_segmentation_challenge/utils/eval_utils/init.py Exposes PQCropDict and removes PER_INSTANCE_THREADS export.
src/cellmap_segmentation_challenge/evaluate.py Updates re-exports to match changed eval_utils surface.
Comments suppressed due to low confidence (1)

src/cellmap_segmentation_challenge/utils/eval_utils/types.py:32

  • With the shift to PQ metrics, score_instance / score_semantic no longer return the fields described by InstanceScoreDict and SemanticScoreDict (they now return PQ accumulators like tp/fp/fn/sum_iou). Right now those TypedDicts remain exported, which is misleading for API consumers and type checking. Consider updating/deprecating InstanceScoreDict / SemanticScoreDict and annotating the PQ-returning functions with PQCropDict (or introducing dedicated PQ TypedDicts for instance vs semantic if needed).
class PQCropDict(TypedDict, total=False):
    """PQ accumulators returned per (crop, label) by score_label.

    Raw accumulators (tp/fp/fn/sum_iou) are summed globally across crops in
    ``combine_scores`` before per-category PQ/SQ/RQ values are derived.
    The derived ``pq`` and ``f1`` fields are computed per-crop for
    interpretability and are NOT used in the global aggregation.
    """

    tp: int
    fp: int
    fn: int
    sum_iou: float
    pq: float  # per-crop PQ = sum_iou / (tp + 0.5*fp + 0.5*fn)
    sq: float  # per-crop SQ = sum_iou / tp (mean IoU of matched pairs; 0 if tp=0)
    f1: float  # per-crop F1/RQ = 2*tp / (2*tp + fp + fn)
    num_voxels: int
    voxel_size: tuple[float, ...]
    is_missing: bool
    status: Literal["scored", "skipped_too_many_instances", "missing"]


class InstanceScoreDict(TypedDict, total=False):
    """Type definition for instance segmentation scores."""

    mean_accuracy: float
    binary_accuracy: float

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 24, 2026

@rhoadesScholar I've opened a new pull request, #201, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

src/cellmap_segmentation_challenge/evaluate.py:30

  • evaluate.py is described as re-exporting public APIs for backward compatibility, but it no longer re-exports the previously-public InstanceScoreDict / SemanticScoreDict types (and PER_INSTANCE_THREADS was also dropped). If external users import these symbols from cellmap_segmentation_challenge.evaluate, this is a breaking change. Consider keeping the old names as deprecated aliases (e.g., aliasing to the new PQCropDict / keeping PER_INSTANCE_THREADS), or document the API break explicitly.
# Re-export all public APIs for backward compatibility
from .utils.eval_utils import (
    # Exceptions
    EvaluationError,
    TooManyInstancesError,
    TooManyOverlapEdgesError,
    MatchingFailedError,
    ValidationError,
    # Configuration
    EvaluationConfig,
    CAST_TO_NONE,
    MAX_INSTANCE_THREADS,
    MAX_SEMANTIC_THREADS,
    MAX_DISTANCE_CAP_EPS,
    FINAL_INSTANCE_RATIO_CUTOFF,
    INITIAL_INSTANCE_RATIO_CUTOFF,
    INSTANCE_RATIO_FACTOR,
    MAX_OVERLAP_EDGES,

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants