Questions about evaluation protocol, prediction normalization, and data augmentation in the released training pipeline

Hi, thank you for releasing the SCSegamba codebase. I am trying to reproduce and understand the reported results, but I found several points in the released implementation that seem unclear or potentially inconsistent with standard binary segmentation evaluation. I would appreciate clarification.

## 1. Prediction output is saved after per-image max normalization instead of sigmoid

In the testing/evaluation pipeline, the model output appears to be saved as:

```python
out = out[0, 0, ...].cpu().numpy()
out = 255 * (out / np.max(out))
cv2.imwrite(..., out)
```

However, during training the model uses `BCEWithLogitsLoss`, which suggests that the model output should be treated as logits. In a standard binary segmentation pipeline, the evaluation would usually be:

```python
prob = sigmoid(logits)
pred = prob > threshold
```

or equivalently, for threshold 0.5:

```python
pred = logits > 0
```

The current per-image normalization changes the meaning of the threshold. For example, if two pixels have logits 8 and 20, both correspond to sigmoid probabilities close to 1.0. But after `out / max(out)`, the logit 8 pixel becomes 0.4 if the maximum logit in that image is 20. Therefore, thresholding the normalized map is no longer equivalent to thresholding probabilities.

Could you clarify why the raw logits are normalized by the maximum value of each image instead of applying sigmoid?

## 2. Precision, Recall, and F1 seem to be reported at threshold = 0.0

In `evaluate.py`, `cal_prf_metrics()` scans thresholds from 0.0 to 1.0:

```python
for thresh in np.arange(0.0, 1.0, thresh_step):
    pred_img = (pred / 255 > thresh).astype('uint8')
    ...
```

However, in `eval()`, the reported values are:

```python
F1 = F_list[0]
Precision = Precision_list[0]
Recall = Recall_list[0]
```

This means that the reported Precision/Recall/F1 correspond to `threshold = 0.0`, not `threshold = 0.5`, and not the best F1 threshold.

Could you confirm whether this is intended? If so, should the reported F1 be interpreted as the result of `pred / 255 > 0.0` rather than a standard probability threshold?

## 3. mIoU is computed using threshold sweeping and includes the background class

The mIoU implementation appears to compute:

```python
IoU_crack = TP / (TP + FP + FN)
IoU_bg = TN / (TN + FP + FN)
mIoU = (IoU_crack + IoU_bg) / 2
```

and then takes the maximum value over all thresholds:

```python
mIoU = np.max(final_iou)
```

Therefore, the reported mIoU seems to be:

```text
max_threshold mean_image((IoU_crack + IoU_background) / 2)
```

rather than foreground/crack IoU at a fixed threshold.

Could you clarify whether the reported mIoU in the paper/table is the two-class mIoU including background, and whether it is selected using the best threshold on the evaluation set?

## 4. Best checkpoint appears to be selected by mIoU, not F1

From the training pipeline, the best checkpoint seems to be saved based on:

```python
if max_mIoU < metrics['mIoU']:
    save checkpoint_best.pth
```

So the selected best model is based on the best-threshold two-class mIoU, not F1.

Could you confirm whether the reported best model is selected by mIoU or by F1?

## 5. Data augmentation does not seem to be used in the actual training dataset

I noticed that there are augmentation-related utility functions in the repository, but in the actual `CrackDataset.__getitem__()` pipeline, the data processing seems to consist mainly of:

* image/mask loading
* resizing
* mask thresholding
* ToTensor
* normalization

I could not find random crop, random flip, random rotation, random affine, or other online augmentation being applied in the actual training path.

Could you clarify whether online data augmentation is used in the released training code? If not, were the reported results obtained using only resizing and normalization, or was offline augmentation applied before training?

## Summary

To make the evaluation protocol easier to reproduce, it would be helpful to clarify the following:

1. Whether predictions should be evaluated from raw logits, sigmoid probabilities, or per-image max-normalized maps.
2. Whether reported Precision/Recall/F1 use threshold 0.0, threshold 0.5, or best threshold.
3. Whether reported mIoU includes background and whether it is selected by threshold sweeping.
4. Whether the best checkpoint is selected by mIoU or F1.
5. Whether online data augmentation is actually enabled in the released training pipeline.

Thank you.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about evaluation protocol, prediction normalization, and data augmentation in the released training pipeline #42

1. Prediction output is saved after per-image max normalization instead of sigmoid

2. Precision, Recall, and F1 seem to be reported at threshold = 0.0

3. mIoU is computed using threshold sweeping and includes the background class

4. Best checkpoint appears to be selected by mIoU, not F1

5. Data augmentation does not seem to be used in the actual training dataset

Summary

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Questions about evaluation protocol, prediction normalization, and data augmentation in the released training pipeline #42

Description

1. Prediction output is saved after per-image max normalization instead of sigmoid

2. Precision, Recall, and F1 seem to be reported at threshold = 0.0

3. mIoU is computed using threshold sweeping and includes the background class

4. Best checkpoint appears to be selected by mIoU, not F1

5. Data augmentation does not seem to be used in the actual training dataset

Summary

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions