Conversation
…ations - Consolidated parallel scoring execution into a single process pool for improved performance. - Updated scoring metrics to include panoptic quality (PQ), segmentation quality (SQ), and recognition quality (RQ). - Introduced new data structures for PQ accumulators and updated related tests. - Refactored tests to align with new scoring metrics and ensure comprehensive coverage. - Removed outdated test file for refactored evaluate functions.
There was a problem hiding this comment.
Pull request overview
This PR updates the evaluation pipeline to use Panoptic Quality (PQ)–style accumulators (tp/fp/fn/sum_iou) for both instance (“thing”) and semantic (“stuff”) scoring, and changes aggregation to compute macro-averaged PQ across crops per label.
Changes:
- Replace instance/semantic metrics with PQ accumulator outputs and compute per-crop derived PQ/SQ/F1 fields.
- Update aggregation to produce per-label
pq/sq/rqand overall scores as unweighted means across labels (with legacy aliases preserved). - Simplify parallel scoring to a single
ProcessPoolExecutorcontrolled byMAX_WORKERS, and update tests to assert PQ behavior.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_evaluate_refactored.py | Removes a large refactor-focused test suite (coverage impact). |
| tests/test_evaluate_metrics.py | Updates tests to validate PQ accumulator semantics and new aggregation outputs. |
| src/cellmap_segmentation_challenge/utils/eval_utils/types.py | Adds PQCropDict TypedDict for PQ crop-level outputs. |
| src/cellmap_segmentation_challenge/utils/eval_utils/submission.py | Uses a unified process pool; updates score_submission results documentation. |
| src/cellmap_segmentation_challenge/utils/eval_utils/scoring.py | Reworks score_instance/score_semantic to return PQ accumulators and per-crop derived PQ/SQ/F1. |
| src/cellmap_segmentation_challenge/utils/eval_utils/instance_matching.py | Adds greedy IoU-threshold PQ matching (match_instances_pq). |
| src/cellmap_segmentation_challenge/utils/eval_utils/distance.py | Reads Hausdorff threading from legacy env var directly (PER_INSTANCE_THREADS). |
| src/cellmap_segmentation_challenge/utils/eval_utils/config.py | Introduces max_workers with env precedence and legacy fallback. |
| src/cellmap_segmentation_challenge/utils/eval_utils/aggregation.py | Aggregates PQ with macro-averaged per-crop PQ per label + overall unweighted means. |
| src/cellmap_segmentation_challenge/utils/eval_utils/init.py | Exposes PQCropDict and removes PER_INSTANCE_THREADS export. |
| src/cellmap_segmentation_challenge/evaluate.py | Updates re-exports to match changed eval_utils surface. |
Comments suppressed due to low confidence (1)
src/cellmap_segmentation_challenge/utils/eval_utils/types.py:32
- With the shift to PQ metrics,
score_instance/score_semanticno longer return the fields described byInstanceScoreDictandSemanticScoreDict(they now return PQ accumulators liketp/fp/fn/sum_iou). Right now those TypedDicts remain exported, which is misleading for API consumers and type checking. Consider updating/deprecatingInstanceScoreDict/SemanticScoreDictand annotating the PQ-returning functions withPQCropDict(or introducing dedicated PQ TypedDicts for instance vs semantic if needed).
class PQCropDict(TypedDict, total=False):
"""PQ accumulators returned per (crop, label) by score_label.
Raw accumulators (tp/fp/fn/sum_iou) are summed globally across crops in
``combine_scores`` before per-category PQ/SQ/RQ values are derived.
The derived ``pq`` and ``f1`` fields are computed per-crop for
interpretability and are NOT used in the global aggregation.
"""
tp: int
fp: int
fn: int
sum_iou: float
pq: float # per-crop PQ = sum_iou / (tp + 0.5*fp + 0.5*fn)
sq: float # per-crop SQ = sum_iou / tp (mean IoU of matched pairs; 0 if tp=0)
f1: float # per-crop F1/RQ = 2*tp / (2*tp + fp + fn)
num_voxels: int
voxel_size: tuple[float, ...]
is_missing: bool
status: Literal["scored", "skipped_too_many_instances", "missing"]
class InstanceScoreDict(TypedDict, total=False):
"""Type definition for instance segmentation scores."""
mean_accuracy: float
binary_accuracy: float
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/cellmap_segmentation_challenge/utils/eval_utils/instance_matching.py
Show resolved
Hide resolved
src/cellmap_segmentation_challenge/utils/eval_utils/instance_matching.py
Outdated
Show resolved
Hide resolved
|
@rhoadesScholar I've opened a new pull request, #201, to work on those changes. Once the pull request is ready, I'll request review from you. |
…ng on re-call Co-authored-by: rhoadesScholar <37990507+rhoadesScholar@users.noreply.github.com> Agent-Logs-Url: https://github.com/janelia-cellmap/cellmap-segmentation-challenge/sessions/4757d3c5-3419-49df-8a4f-c4d01db80ffd
Co-authored-by: rhoadesScholar <37990507+rhoadesScholar@users.noreply.github.com> Agent-Logs-Url: https://github.com/janelia-cellmap/cellmap-segmentation-challenge/sessions/10b4e7aa-4559-4c15-86c3-4006a0e9f441
fix: prevent combine_scores from re-consuming aggregation keys on repeated calls
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
src/cellmap_segmentation_challenge/evaluate.py:30
evaluate.pyis described as re-exporting public APIs for backward compatibility, but it no longer re-exports the previously-publicInstanceScoreDict/SemanticScoreDicttypes (andPER_INSTANCE_THREADSwas also dropped). If external users import these symbols fromcellmap_segmentation_challenge.evaluate, this is a breaking change. Consider keeping the old names as deprecated aliases (e.g., aliasing to the newPQCropDict/ keepingPER_INSTANCE_THREADS), or document the API break explicitly.
# Re-export all public APIs for backward compatibility
from .utils.eval_utils import (
# Exceptions
EvaluationError,
TooManyInstancesError,
TooManyOverlapEdgesError,
MatchingFailedError,
ValidationError,
# Configuration
EvaluationConfig,
CAST_TO_NONE,
MAX_INSTANCE_THREADS,
MAX_SEMANTIC_THREADS,
MAX_DISTANCE_CAP_EPS,
FINAL_INSTANCE_RATIO_CUTOFF,
INITIAL_INSTANCE_RATIO_CUTOFF,
INSTANCE_RATIO_FACTOR,
MAX_OVERLAP_EDGES,
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
No description provided.